Causality in a wide sense Lecture III Peter B uhlmann Seminar - - PowerPoint PPT Presentation

causality in a wide sense lecture iii
SMART_READER_LITE
LIVE PREVIEW

Causality in a wide sense Lecture III Peter B uhlmann Seminar - - PowerPoint PPT Presentation

Causality in a wide sense Lecture III Peter B uhlmann Seminar for Statistics ETH Z urich Recap from yesterday causality is giving a prediction to an intervention/manipulation observational data plus interventional data is


slide-1
SLIDE 1

Causality – in a wide sense Lecture III

Peter B¨ uhlmann

Seminar for Statistics ETH Z¨ urich

slide-2
SLIDE 2

Recap from yesterday

◮ causality is giving a prediction to an intervention/manipulation ◮ observational data plus interventional data is much more informative than observational data alone ◮ do-intervention model is simple, easy to understand but

  • ften too specific: we often cannot intervene precisely at

single variables

slide-3
SLIDE 3

Some empirical “experience” with biological data despite the success story in Maathuis, Colombo, Kalisch & PB (2010)

False positives True positives 1,000 2,000 3,000 4,000 200 400 600 800 1,000 IDA Lasso Elastic−net Random

it seems very difficult to have “stable” estimation of graph equivalence classes from data ◮ the problem is much harder than fitting undirected Gaussian graphical models (which is essentially linear regression)

slide-4
SLIDE 4

Methodological “thinking” ◮ inferring causal effects from observation data is very ambitious

(perhaps “feasible in a stable manner” in applications with very large sample size)

◮ using interventional data is beneficial this is what scientists have been doing all the time ❀ the agenda: ◮ exploit (observational-) interventional/perturbation data ◮ for unspecific interventions ◮ in the context of hidden confounding variables (Lecture IV)

slide-5
SLIDE 5

“my vision”: do it without graph estimation

(but use graphs as a language to describe the aims)

slide-6
SLIDE 6

Adversarial Robustness

machine learning, Generative Networks

e.g. Ian Goodfellow Causality e.g. Judea Pearl

Do they have something “in common”?

slide-7
SLIDE 7

Heterogeneous (potentially large-scale) data we will take advantage of heterogeneity

  • ften arising with large-scale data where

i.i.d./homogeneity assumption is not appropriate

slide-8
SLIDE 8

It’s quite a common setting... data from different known observed environments or experimental conditions or perturbations or sub-populations e ∈ E: (X e, Y e) ∼ F e, e ∈ E with response variables Y e and predictor variables X e examples:

  • data from 10 different countries
  • data from different econ. scenarios (from diff. “time blocks”)

immigration in the UK

slide-9
SLIDE 9

consider “many possible” but mostly non-observed environments/perturbations F ⊃ E

  • bserved

examples for F:

  • 10 countries and many other than the 10 countries
  • scenarios until today and new unseen scenarios in the future

immigration in the UK

the unseen future

problem:

predict Y given X such that the prediction works well

(is “robust”) for “many possible” environments e ∈ F based on data from much fewer environments from E

slide-10
SLIDE 10

trained on designed, known scenarios from E

slide-11
SLIDE 11

trained on designed, known scenarios from E new scenario from F!

slide-12
SLIDE 12

Personalized health want to be robust across

environmental factors

slide-13
SLIDE 13

Personalized health want to be robust acrossunseen

environmental factors

slide-14
SLIDE 14

a pragmatic prediction problem: predict Y given X such that the prediction works well (is “robust”) for “many possible” environments e ∈ F based on data from much fewer environments from E for example with linear models: find argminβ max

e∈F E|Y e − (X e)Tβ|2

it is “robustness”

slide-15
SLIDE 15

a pragmatic prediction problem: predict Y given X such that the prediction works well (is “robust”) for “many possible” environments e ∈ F based on data from much fewer environments from E for example with linear models: find argminβ max

e∈F E|Y e − (X e)Tβ|2

it is “robustness”

slide-16
SLIDE 16

a pragmatic prediction problem: predict Y given X such that the prediction works well (is “robust”) for “many possible” environments e ∈ F based on data from much fewer environments from E for example with linear models: find argminβ max

e∈F E|Y e − (X e)Tβ|2

it is “robustness” and remember: causality is predicting an answer to a “what if I do/perturb question”! that is: prediction for new unseen scenarios/environments

slide-17
SLIDE 17

a pragmatic prediction problem: predict Y given X such that the prediction works well (is “robust”) for “many possible” environments e ∈ F based on data from much fewer environments from E for example with linear models: find argminβ max

e∈F E|Y e − (X e)Tβ|2

it is “robustness” and remember: causality is predicting an answer to a “what if I do/perturb question”! that is: prediction for new unseen scenarios/environments

slide-18
SLIDE 18

a pragmatic prediction problem: predict Y given X such that the prediction works well (is “robust”) for “many possible” environments e ∈ F based on data from much fewer environments from E for example with linear models: find argminβ max

e∈F E|Y e − (X e)Tβ|2

it is “robustness” and also about causality and remember: causality is predicting an answer to a “what if I do/perturb question”! that is: prediction for new unseen scenarios/environments

slide-19
SLIDE 19

Prediction and causality indeed, for linear models: in a nutshell for F = {all perturbations not acting on Y directly}, argminβ max

e∈F E|Y e − (X e)Tβ|2 = causal parameter

that is: causal parameter optimizes worst case loss w.r.t. “very many” unseen (“future”) scenarios later: we will discuss models for F and E which make these relations more precise

slide-20
SLIDE 20

Prediction and causality indeed, for linear models: in a nutshell for F = {all perturbations not acting on Y directly}, argminβ max

e∈F E|Y e − (X e)Tβ|2 = causal parameter

that is: causal parameter optimizes worst case loss w.r.t. “very many” unseen (“future”) scenarios later: we will discuss models for F and E which make these relations more precise

slide-21
SLIDE 21

How to exploit heterogeneity? for causality or “robust” prediction

Invariant causal prediction (Peters, PB and Meinshausen, 2016) a main simplifying message:

causal structure/components remain the same for different environments/perturbations

while non-causal components can change across environments thus: ❀ look for “stability” of structures among different environments

slide-22
SLIDE 22

How to exploit heterogeneity? for causality or “robust” prediction

Invariant causal prediction (Peters, PB and Meinshausen, 2016) a main simplifying message:

causal structure/components remain the same for different environments/perturbations

while non-causal components can change across environments thus: ❀ look for “stability” of structures among different environments

slide-23
SLIDE 23

Invariance: a key conceptual assumption Invariance Assumption (w.r.t. E) there exists S∗ ⊆ {1, . . . , d} such that:

L(Y e|X e

S∗) is invariant across e ∈ E

for linear model setting: there exists a vector γ∗ with supp(γ∗) = S∗ = {j; γ∗

j = 0}

such that: ∀e ∈ E : Y e = X eγ∗ + εe, εe ⊥ X e

S∗

εe ∼ Fε the same for all e X e has an arbitrary distribution, different across e γ∗, S∗ is interesting in its own right!

namely the parameter and structure which remain invariant across experimental settings, or heterogeneous groups

slide-24
SLIDE 24

Invariance: a key conceptual assumption Invariance Assumption (w.r.t. E) there exists S∗ ⊆ {1, . . . , d} such that:

L(Y e|X e

S∗) is invariant across e ∈ E

for linear model setting: there exists a vector γ∗ with supp(γ∗) = S∗ = {j; γ∗

j = 0}

such that: ∀e ∈ E : Y e = X eγ∗ + εe, εe ⊥ X e

S∗

εe ∼ Fε the same for all e X e has an arbitrary distribution, different across e γ∗, S∗ is interesting in its own right!

namely the parameter and structure which remain invariant across experimental settings, or heterogeneous groups

slide-25
SLIDE 25

Invariance Assumption: plausible to hold with real data two-dimensional conditional distributions of observational (blue) and interventional (orange) data (no intervention at displayed variables X, Y) seemingly no invariance

  • f conditional d.

plausible invariance

  • f conditional d.
slide-26
SLIDE 26

Invariance Assumption w.r.t. F where F ⊃

  • much larger

E now: the set S∗ and corresponding regression parameter γ∗ are for a much larger class of environments than what we observe! ❀ γ∗, S∗ is even more interesting in its own right! since it says something about unseen new environments!

slide-27
SLIDE 27

Link to causality mathematical formulation with structural equation models: Y ← f(Xpa(Y), ε), Xj ← fj(Xpa(j), εj) (j = 1, . . . , p) ε, ε1, . . . , εd independent

X5

Y

X11 X10 X3 X8 X7 X2

slide-28
SLIDE 28

Link to causality mathematical formulation with structural equation models: Y ← f(Xpa(Y), ε), Xj ← fj(Xpa(j), εj) (j = 1, . . . , p) ε, ε1, . . . , εp independent

X5

Y

X11 X10 X3 X8 X7 X2

(direct) causal variables for Y: the parental variables of Y

slide-29
SLIDE 29

Link to causality problem: under what model for the environments/perturbations e can we have an interesting description of the invariant sets S∗? loosely speaking: assume that the perturbations e ◮ do not act directly on Y ◮ do not change the relation between X and Y

but may act arbitrarily on X (arbitrary shifts, scalings, etc.)

graphical description: E is random with realizations e X Y E not depending on E

slide-30
SLIDE 30

Link to causality problem: under what model for the environments/perturbations e can we have an interesting description of the invariant sets S∗? loosely speaking: assume that the perturbations e ◮ do not act directly on Y ◮ do not change the relation between X and Y

but may act arbitrarily on X (arbitrary shifts, scalings, etc.)

graphical description: E is random with realizations e X Y E not depending on E

slide-31
SLIDE 31

Link to causality problem: under what model for the environments/perturbations e can we have an interesting description of the invariant sets S∗? loosely speaking: assume that the perturbations e ◮ do not act directly on Y ◮ do not change the relation between X and Y

but may act arbitrarily on X (arbitrary shifts, scalings, etc.)

graphical description: E is random with realizations e X Y E not depending on E X Y E H IV model: see Lecture IV

slide-32
SLIDE 32

Link to causality

easy to derive the following:

Proposition

  • structural equation model for (Y, X);
  • model for F of perturbations: every e ∈ F

◮ does not act directly on Y ◮ does not change the relation between X and Y

but may act arbitrarily on X (arbitrary shifts, scalings, etc.)

Then: the causal variables pa(Y) satisfy the invariance assumption with respect to F causal variables lead to invariance under arbitrarily strong perturbations from F as described above

slide-33
SLIDE 33

Proposition

  • structural equation model for (Y, X);
  • model for F of perturbations: every e ∈ F

◮ does not act directly on Y ◮ does not change the relation between X and Y but may act arbitrarily on X (arbitrary shifts, scalings, etc.) Then: the causal variables pa(Y) satisfy the invariance assumption with respect to F

as a consequence: for linear structural equation models for F as above, argminβ max

e∈F E|Y e − (X e)Tβ|2 =

β0

pa(Y)

causal parameter if the perturbations in F would not be arbitrarily strong ❀ the worst-case optimizer is different! (see later)

slide-34
SLIDE 34

Proposition

  • structural equation model for (Y, X);
  • model for F of perturbations: every e ∈ F

◮ does not act directly on Y ◮ does not change the relation between X and Y but may act arbitrarily on X (arbitrary shifts, scalings, etc.) Then: the causal variables pa(Y) satisfy the invariance assumption with respect to F

as a consequence: for linear structural equation models for F as above, argminβ max

e∈F E|Y e − (X e)Tβ|2 =

β0

pa(Y)

causal parameter if the perturbations in F would not be arbitrarily strong ❀ the worst-case optimizer is different! (see later)

slide-35
SLIDE 35

A real-world example and the assumptions Y: growth rate of the plant X: high-dim. covariates of gene expressions perturbations e: different gene knock-out experiments ❀ e changes the expressions of some components of X it’s plausible that perturbations e ◮ do not directly act on Y √ ◮ do not change the relation between X and Y ?

may act arbitrarily on X (arbitrary shifts, scalings, etc.)

slide-36
SLIDE 36

A real-world example and the assumptions Y: growth rate of the plant X: high-dim. covariates of gene expressions perturbations e: different gene knock-out experiments ❀ e changes the expressions of some components of X it’s plausible that perturbations e ◮ do not directly act on Y √ ◮ do not change the relation between X and Y ?

may act arbitrarily on X (arbitrary shifts, scalings, etc.)

slide-37
SLIDE 37

Causality ⇐ ⇒ Invariance we just argued: causal variables = ⇒ invariance known since a long time:

Haavelmo (1943)

Trygve Haavelmo

Nobel Prize in Economics 1989

(...; Goldberger, 1964; Aldrich, 1989;... ; Dawid and Didelez, 2010)

slide-38
SLIDE 38

Causality ⇐ ⇒ Invariance we just argued: causal variables = ⇒ invariance known since a long time:

Haavelmo (1943)

Trygve Haavelmo

Nobel Prize in Economics 1989

(...; Goldberger, 1964; Aldrich, 1989;... ; Dawid and Didelez, 2010) more novel: the reverse relation causal structure, predictive robustness ⇐ = invariance (Peters, PB & Meinshausen, 2016)

slide-39
SLIDE 39

The search for invariance and causality (Peters, PB & Meinshausen, 2016) causal structure/variables ⇐ = invariance

X5 Y X11 X10 X3 X8 X7 X2

severe issues of identifiability !

can perform statistical test whether a subset S of covariates satisfies the invariance assumption H0−InvA(E) : L(Y e|X e

S) is invariant across e ∈

E

  • bserved environments

in a linear model ❀ Chow (1960) ❀ sets S1, . . . , Sk which are statistically compatible with invariance assumption H0−InvA(E)

slide-40
SLIDE 40

making it identifiable: ˆ S(E) =

  • {S; S statistically compatible with H0−InvA(E)
  • no rejection at significance level α

} Theorem: (Peters, PB and Meinshausen, 2016) assume structural equation model ◮ linear model for Y versus X, Gaussian errors ◮ e ∈ E does not act directly on Y and does not change the relation between X and Y Then: P[ˆ S(E) ⊆ Scausal

pa(Y)

] ≥ 1 − α confidence guarantee against false positive causal selection ICP = Invariant Causal Prediction

slide-41
SLIDE 41

making it identifiable: ˆ S(E) =

  • {S; S statistically compatible with H0−InvA(E)
  • no rejection at significance level α

} Theorem: (Peters, PB and Meinshausen, 2016) assume structural equation model ◮ linear model for Y versus X, Gaussian errors ◮ e ∈ E does not act directly on Y and does not change the relation between X and Y Then: P[ˆ S(E) ⊆ Scausal

pa(Y)

] ≥ 1 − α confidence guarantee against false positive causal selection ICP = Invariant Causal Prediction

slide-42
SLIDE 42

Single gene deletion experiments in yeast

d = 6170 genes response of interest: Y = expression of first gene “covariates” X = gene expressions from all other genes and then response of interest: Y = expression of second gene “covariates” X = gene expressions from all other genes and so on infer/predict the effects of unseen/new single gene deletions on all other genes

slide-43
SLIDE 43

Kemmeren et al. (2014):

genome-wide mRNA expressions in yeast: d = 6170 genes ◮ nobs = 160 “observational” samples of wild-types ◮ nint = 1479 “interventional” samples each of them corresponds to a single gene deletion strain for our method: we use |E| = 2 (observational and interventional data) training-test data splitting:

  • training set: all observational and 2/3 of interventional data
  • test set: other 1/3 of gene deletion interventions

❀ can validate predicted effects of these interventions

  • repeat this for the three blocks of interventional test data

multiplicity adjustment: since ICP is used 6170 times (once for every response var.) we use coverage 1 − α/6170 with α = 0.05

slide-44
SLIDE 44

Results for inferring causal variables on a single training-test split 8 genes are “significant” (α = 0.05 level) causal variables (each of the 8 genes “causes” one other gene)

slide-45
SLIDE 45

Results for inferring causal variables on a single training-test split 8 genes are “significant” (α = 0.05 level) causal variables (each of the 8 genes “causes” one other gene) not many findings...

1 2 6170

but we use a stringent criterion with Bonferroni corrected α/6170 = 0.05/6170 to control the familywise error rate

slide-46
SLIDE 46

8 genes are “significant” (α = 0.05 level) causal variables validation: thanks to the intervention experiments (in the test data) we can validate the method(s) we only consider true Strong Intervention Effects (SIEs)

SIE = the observed response value associated to an intervention is in the 1%- or 99% tail of the observational data

slide-47
SLIDE 47

8 genes are “significant” (α = 0.05 level) causal variables validation: thanks to the intervention experiments (in the test data) we can validate the method(s) we only consider true Strong Intervention Effects (SIEs)

6 out of the 8 “significant” genes are true SIEs!

SIE = the observed response value associated to an intervention is in the 1%- or 99% tail of the observational data

slide-48
SLIDE 48

# INTERVENTION PREDICTIONS # STRONG INTERVENTION EFFECTS 5 10 15 20 25 2 4 6 8 PERFECT INVARIANT HIDDEN−INVARIANT PC RFCI REGRESSION (CV−Lasso) GES and GIES RANDOM (99% prediction− interval)

I : invariant prediction method H: invariant prediction with some hidden variables

slide-49
SLIDE 49

Predicting a potential outcome

  • −10

−5 5 10 −10 −5 5 10 x y

slide-50
SLIDE 50

Predicting a potential outcome

manipulate x = −8

  • −10

−5 5 10 −10 −5 5 10 x y

slide-51
SLIDE 51

It’s an ambitious problem

manipulate x = −8

  • −10

−5 5 10 −10 −5 5 10 x y

  • X

Y

slide-52
SLIDE 52

It’s an ambitious problem

manipulate x = −8

  • −10

−5 5 10 −10 −5 5 10 x y

  • X

Y

slide-53
SLIDE 53

Invariance and novel robustness

◮ exact invariance and corresponding causality may be often too ambitious ◮ the perturbations in future data might not be so strong (as in the gene knock-out example) more pragmatic: construct “best” predictions in heterogeneous settings ❀ a novel robustness viewpoint (see Lecture IV)

slide-54
SLIDE 54

The Causal Dantzig estimator (Rothenh¨

ausler, PB & Meinshausen, 2019)

ICP (Invariant Causal Prediction) ◮ requires an all subset selection search ◮ does not allow for hidden confounding variables ◮ is rather general in terms of interventions/perturbations we can have a methodology and algorithm which ◮ is computationally efficient (convex optimization) ◮ allows for hidden confounding ◮ is more restrictive w.r.t. interventions/perturbations ❀ Causal Dantzig estimator/algorithm

slide-55
SLIDE 55

instead of invariance of conditional distributions, require Assumption: inner product invariance under β∗ E[X e

j (Y e − X eβ∗)] = E[X e′ j (Y e′ − X e′β∗)] ∀ e, e′ ∈ E, ∀ j

Theorem: Consider X ← BX + ε0 ❀ Y = Xp+1 = X Tβcausal + εY Inner product invariance holds under the causal coefficient vector βcausal if ◮ the interventions/environments do not act directly on Y ◮ the interventions are additive noise interventions: εe = ε0 + δe E[ε0] = 0, Cov(ε0, δe) = 0, δe

Y ≡ 0

and the theorem extends to SEMs with measurement errors

slide-56
SLIDE 56

εe = ε0 + δe E[ε0] = 0, Cov(ε0, δe) = 0, δe

Y ≡ 0

ε0 and δe can have dependent components ❀ hidden variables are covered “reason”: X Y H Y ← Xβ + Hδ + εY = Xβ + ηY X ← Hγ + εX = ηX the η error terms are now dependent!

slide-57
SLIDE 57

Causal Dantzig without regularization for low-dimensional settings consider two environments e = 1 and e′ = 2 differences of Gram matrices: ˆ Z = n−1

1 (X1)TY1 − n−1 2 (X2)TY2,

ˆ G = n−1

1 (X1)TX1 − n−1 2 (X2)TX2

under inner product invariance with β∗: E[ˆ Z − ˆ Gβ∗] = 0 ❀ ˆ β = argminβˆ Z − ˆ Gβ∞ asymptotic Gaussian distribution with explicit estimable covariance matrix Γ if βcausal is non-identifiable: the covariance matrix Γ is singular in certain directions ❀ infinite marginal confidence intervals for non-identifiable coefficients βcausal,k

slide-58
SLIDE 58

Regularized Causal Dantzig ˆ β = argminββ1 such that ˆ Z − ˆ Gβ∞ ≤ λ in analogy to the classical Dantzig selector (Candes & Tao, 2007) which uses ˜ Z = n−1XTY, ˜ G = n−1XTX using the machinery of high-dimensional statistics and assuming identifiability ... ˆ β − βcausalq ≤ O(s1/q log(p)/ min(n1, n2)) for q ≥ 1

slide-59
SLIDE 59

various options to deal with more than two environments: e.g. all pairs and aggregation

slide-60
SLIDE 60

Flow cytometry data (Sachs et al., 2005)

◮ p = 11 abundances of chemical reagents ◮ 8 different environments (not “well-defined” interventions) (one of them observational; 7 different reagents added) ◮ each environment contains ne ≈ 700 − 1′000 samples goal: recover network of causal relations (linear SEM)

Raf Mek PLCg PIP2 PIP3 Erk Akt PKA PKC p38 JNK

approach: “pairwise” invariant causal prediction (one variable the response Y; the other 10 the covariates X; do this 11 times with every variable once the response)

slide-61
SLIDE 61

Raf Mek PLCg PIP2 PIP3 Erk Akt PKA PKC p38 JNK

blue edges: only invariant causal prediction approach (ICP) red: only ICP allowing hidden variables and feedback purple: both ICP with and without hidden variables solid: all relations that have been reported in literature broken: new findings not reported in the literature

❀ reasonable consensus with existing results but no real ground-truth available serves as an illustration that we can work with “vaguely defined interventions”

slide-62
SLIDE 62

Conclusions

◮ causality can be framed as worst case risk optimization! more on that in Lecture IV ◮ causality can be inferred from invariance and a “stability” argument ◮ ICP (Invariant Causal Prediction) is a conceptual approach and method Causal Dantzig is more powerful and “makes more statistical sense”, at the price of restricting the interventions

slide-63
SLIDE 63

make heterogeneity or non-stationarity your friend

(rather than your enemy)!

slide-64
SLIDE 64

make heterogeneity or non-stationarity your friend

(rather than your enemy)!

slide-65
SLIDE 65

References

◮ B¨ uhlmann, P . (2018). Invariance, Causality and Robustness. To appear in Statistical Science. Preprint arXiv:1812.08233 ◮ Meinshausen, N., Hauser, A., Mooij, J.M., Peters, J., Versteeg, P . and B¨ uhlmann, P . (2016). Methods for causal inference from gene perturbation experiments and

  • validation. Proceedings of the National Academy of Sciences USA 113,

7361-7368. ◮ Peters, J., B¨ uhlmann, P . and Meinshausen, N. (2016). Causal inference using invariant prediction: identification and confidence intervals (with discussion). Journal of the Royal Statistical Society, Series B 78, 947-1012. ◮ Pfister, N., B¨ uhlmann, P . and Peters, J. (2018). Invariant causal prediction for sequential data. Journal of the American Statistical Association, published online DOI 10.1080/01621459.2018.1491403. ◮ Rothenh¨ ausler, D., B¨ uhlmann, P . and Meinshausen, N. (2019). Causal Dantzig: fast inference in linear structural equation models with hidden variables under additive interventions. Annals of Statistics 47, 1688-1722. ◮ Rothenh¨ usler, D., Meinshausen, N., B¨ uhlmann, P . and Peters, J. (2018). Anchor regression: heterogeneous data meets causality. Preprint arXiv:1801.06229.

slide-66
SLIDE 66

Robustness remember: ◮ if model is not correct exhibiting e.g. nonlinearities ❀ loss of power, but controlling false positives is still OK ◮ if Invariance Assumption does not hold ❀ loss of power, but controlling false positives is still OK ◮ hidden variables ❀ the method might pick up ancestors of Y X1 X2 X3 Y H X4 e.g. X2 which still exhibits a total intervention/causal effect (and hence is interesting for the gene perturbation experiments)