Causal Inference Theory and Applications Dr. Matthias Uflacker, - - PowerPoint PPT Presentation

causal inference theory and applications
SMART_READER_LITE
LIVE PREVIEW

Causal Inference Theory and Applications Dr. Matthias Uflacker, - - PowerPoint PPT Presentation

Causal Inference Theory and Applications Dr. Matthias Uflacker, Johannes Huegle, Christopher Schmidt April 24, 2018 Agenda April 24, 2018 Jupyter Notebook Causal Inference in Application Recap Causal Inference in a Nutshell


slide-1
SLIDE 1

Causal Inference – Theory and Applications

  • Dr. Matthias Uflacker, Johannes Huegle, Christopher Schmidt

April 24, 2018

slide-2
SLIDE 2

Jupyter Notebook „Causal Inference in Application“

Recap Causal Inference in a Nutshell

Introduction to Structural Causal Models

1.

Preliminaries

2.

Structural Causal Models

3.

(Local) Markov Condition

4.

Factorization

5.

Global Markov Condition

6.

Functional Model and Markov conditions

7.

Faithfulness

8.

Constraint-based Causal Inference

9.

Markov Equivalence Class

10.

Summary

11.

Excursion: Maximal Ancestral Graphs

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 2

Agenda April 24, 2018

slide-3
SLIDE 3

Jupyter Notebook “Causal Inference in Application”

slide-4
SLIDE 4

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 4

Jupyter Notebook Causal Inference in Application

slide-5
SLIDE 5

System Procedure

  • 1. Login via LDAP (standard HPI credentials)
  • 2. Use folder Causal Inference –

Theory and Applications

  • 3. We provide a Master Notebook

Please use as a read only resource Copy relevant information into your local workspace

  • 4. Your local workspace either in your

home directory or as a separate folder in our courses’ folder

  • 5. Let us know if you require new packages

Slide 5 Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications

Link will be provided via email once we have the list of participants!

Jupyter Notebook Access Information

slide-6
SLIDE 6

Causal Inference in a Nutshell

slide-7
SLIDE 7

Causal Inference in a Nutshell Recap: The Concept

Slide 7 Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications

Traditional Statistical Inference Paradigm Paradigm of Structural Causal Models Joint Distribution Data Data Generating Model

Inference

Aspects of 𝑸 Aspects of 𝑯

Inference E.g., what is the sailors’ probability of recovery when we see a treatment with lemons? 𝑹 𝑸 = 𝑸 𝒔𝒇𝒅𝒑𝒘𝒇𝒔𝒛 𝒎𝒇𝒏𝒑𝒐𝒕 E.g., what is the sailors’ probability of recovery if we do treat them with lemons? 𝑹 𝑯 = 𝑸 𝒔𝒇𝒅𝒑𝒘𝒇𝒔𝒛 𝒆𝒑(𝒎𝒇𝒏𝒑𝒐𝒕)

slide-8
SLIDE 8

Introduction to Structural Causal Models

slide-9
SLIDE 9

1.

Preliminaries

2.

Structural Causal Models

3.

(Local) Markov Condition

4.

Factorization

5.

Global Markov Condition

6.

Functional Model and Markov conditions

7.

Faithfulness

8.

Constraint-based Causal Inference

9.

Markov Equivalence Class

  • 10. Summary
  • 11. Excursion: Maximal Ancestral Graphs

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 9

Introduction to Causal Graphical Models Content

slide-10
SLIDE 10

𝐵, 𝐶 events

𝑌, 𝑍, 𝑎 random variables

𝑦 value of random variable

𝑄𝑠 probability measure

𝑄

𝑌 probability distribution of 𝑌

𝑞 density

𝑞𝑦 or 𝑞 𝑌 density of 𝑄

𝑌

𝑞 𝑦 density of 𝑄

𝑌 evaluated at the point 𝑦

𝑌 ⊥ 𝑍 independence of 𝑌 and 𝑍

𝑌 ⊥ 𝑍 | 𝑎 conditional independence of 𝑌 and 𝑍 given 𝑎

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 10

  • 1. Preliminaries

Notation

slide-11
SLIDE 11

Two events 𝐵 and 𝐶 are called independent if Pr 𝐵 ∩ 𝐶 = Pr 𝐵 ⋅ Pr 𝐶 ,

  • r - rewritten in conditional probabilities - if

Pr A = 𝐵 ∩ 𝐶 𝐶 = Pr 𝐵 𝐶 , Pr B = 𝐵 ∩ 𝐶 𝐵 = Pr 𝐶 𝐵 .

𝐵1, … , 𝐵𝑜 are called (mutually) independent if for every subset 𝑇 ⊂ {1, … , 𝑜} we have Pr ሩ

𝑗∈𝑇

𝐵𝑗 = ෑ

𝑗∈𝑇

Pr 𝐵𝑗 .

Note: for 𝑜 ≥ 3, pairwise independence Pr 𝐵𝑗 ∩ 𝐵𝑘 = Pr 𝐵𝑗 ⋅ Pr 𝐵𝑘 for all 𝑗, 𝑘 does not imply (mutual) independence.

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 11

  • 1. Preliminaries

Independence of Events

slide-12
SLIDE 12

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 12

  • 1. Preliminaries

Independence of Random Variables

Two real-valued random variables X and 𝑍 are called independent, 𝑌 ⊥ 𝑍, if for every x, 𝑧 ∈ ℝ, the events 𝑌 ≤ 𝑦 and {𝑍 ≤ 𝑧} are independent, Or, in terms of densities: for all 𝑦, 𝑧, 𝑞 𝑦, 𝑧 = 𝑞 𝑦 𝑞 𝑧 .

Note: If 𝑌 ⊥ 𝑍, then E XY = E X E[Y], and 𝑑𝑝𝑤 𝑌, 𝑍 = 𝐹 𝑌𝑍 − 𝐹 𝑌 𝐹 𝑍 = 0. The converse is not true: If, 𝑑𝑝𝑤 𝑌, 𝑍 = 0, then 𝑌 ⊥ 𝑍. However, we have, for large ℱ: ∀𝑔, 𝑕 ∈ ℱ: 𝑑𝑝𝑤 𝑔 𝑌 , 𝑕 𝑍 = 0 , then 𝑌 ⊥ 𝑍.

No correlation does not imply independence

slide-13
SLIDE 13

Two real-valued random variables X and 𝑍 are called conditionally independent given Z, 𝑌 ⊥ 𝑍 | 𝑎 or (𝑌 ⊥ 𝑍 | 𝑎)𝑄 if 𝑞 𝑦, 𝑧 𝑨 = 𝑞 𝑦 𝑨 𝑞(𝑧|𝑨) For all 𝑦, 𝑧 and for all 𝑨 s.t. 𝑞 𝑨 > 0.

Note: It is possible to find 𝑌, 𝑍 which are conditionally independent given a variable 𝑎 but unconditionally dependent, and vice versa.

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 13

  • 1. Preliminaries

Conditional Independence of Random Variables

slide-14
SLIDE 14

■ Directed Acyclic Graph (DAG) 𝐻 = (𝑊, 𝐹) □ Vertices 𝑊

1, … , 𝑊 𝑜

□ Directed edges 𝐹 = (𝑊

𝑗, 𝑊 𝑘), i.e., 𝑊 𝑗 → 𝑊 𝑘,

□ No cycles ■ Use kinship terminology, e.g., for path 𝑊

𝑗 → 𝑊 𝑘 → 𝑊 𝑙

□ 𝑊

𝑗 = 𝑄𝑏(𝑊 𝑘) parent of 𝑊 𝑘

𝑊

𝑗, 𝑊 𝑘 = 𝐵𝑜𝑕 𝑊 𝑙

ancestors of 𝑊

𝑙

𝑊

𝑘, 𝑊 𝑙 = 𝐸𝑓𝑡(𝑊 𝑗) descendants of 𝑊 𝑗

■ Directed Edges encode direct causes via □ 𝑊

𝑘 = 𝑔 𝑘 Pa Vj , Nj

with independent noise 𝑂1, … , 𝑂𝑜

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 14

  • 2. Structural Causal Models

Definition (Pearl)

Cooling House Example:

▪ 𝑊

1 = 𝑂 0,1

▪ 𝑊

2 = 𝑂 0,1

▪ 𝑊

3 = 3 𝑊 2 + 𝑂(0,1)

▪ 𝑊

4 = 4 𝑊 1 + 5 𝑊 2 + 0.7 𝑊 3 + 𝑂(0,1)

▪ 𝑊

5 = 𝑊 4 + 𝑂(0,1)

▪ 𝑊

6 + 1.2 𝑊 4 + 𝑂(0,1)

This forms the Causal Graphical Model

𝑊

1

𝑊

2

𝑊

4

𝑊

3

𝑊

5

𝑊

6

slide-15
SLIDE 15

Basic Assumption: Causal Sufficiency

All relevant variables are included in the DAG 𝐻

Key Postulate: (Local) Markov Condition

Essential mathematical concept: d-separation (describes the conditional independences required by a causal DAG)

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 15

  • 2. Structural Causal Models

Connecting 𝐻 and 𝑄

Joint Distribution Data Generating Model

𝒀 ⊥ 𝒁 𝒂 𝑯 ⇒ 𝒀 ⊥ 𝒁 𝒂 𝑸

slide-16
SLIDE 16

I.e., every information exchange with its nondescendants involves its parents

Example:

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 16

  • 3. (Local) Markov Condition

Theorem

(Local) Markov Condition: 𝑊

𝑘 statistically independent of nondescendants, given parents 𝑄𝑏(𝑊 𝑘), i.e.,

𝑾𝒌 ⊥ 𝑾𝑾/𝑬𝒇𝒕(𝑾𝒌)|𝑸𝒃 𝑾𝒌 .

𝑊

6 ⊥ 𝑊 1, 𝑊 2, 𝑊 3 |𝑊 4

𝑊

5 ⊥ 𝑊 1, 𝑊 2, 𝑊 3 |𝑊 4

𝑊

1

𝑊

2

𝑊

4

𝑊

3

𝑊

5

𝑊

6

slide-17
SLIDE 17

Assume 𝑊

𝑜 has no descendants, then 𝑂𝐸𝑜 = {𝑊 1, … , 𝑊 𝑜−1}.

Thus the local Markov condition implies 𝑊

𝑜 ⊥ 𝑊 1, … , 𝑊 𝑜−1 |𝑄𝑏 𝑊 𝑜 .

Hence, the general decomposition 𝑞 𝑤1, … , 𝑤𝑜 = 𝑞 𝑤𝑜 𝑤1, … , 𝑤𝑜−1 𝑞(𝑤1, … , 𝑤𝑜−1) becomes 𝑞 𝑤1, … , 𝑤𝑜 = 𝑞 𝑤𝑜 𝑄𝑏(𝑤𝑜 ) 𝑞 𝑤1, … , 𝑤𝑜−1 .

Induction over 𝑜 yields to 𝑞 𝑤1, … , 𝑤𝑜 = ෑ

𝑗=1 𝑜

𝑞 𝑤𝑗 𝑄𝑏 𝑤𝑗 .

I.e., the graph shows us how to factor the joint distribution 𝑄𝑊.

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 17

  • 3. (Local) Markov Condition

Supplement (Lauritzen 1996)

slide-18
SLIDE 18

I.e., conditionals as causal mechanisms generating statistical dependence

Example:

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 18

  • 4. Factorization

Definition

Factorization: 𝑞 𝑤1, … , 𝑤𝑜 = ෑ

𝑗=1 𝑜

𝑞 𝑤𝑗 𝑄𝑏 𝑤𝑗 .

𝑞 𝑊 = 𝑞 𝑤1, … , 𝑤𝑜 = 𝑞 𝑤1 ⋅ 𝑞 𝑤2 ⋅ 𝑞 𝑤3 𝑤2 ⋅ 𝑞 𝑤4 𝑤1, 𝑤2, 𝑤3 ⋅ 𝑞 𝑤5 𝑤4 ⋅ 𝑞 𝑤6 𝑤4 = ς𝑗=1

𝑜

𝑞 𝑤𝑗 𝑄𝑏 𝑤𝑗

𝑊

1

𝑊

2

𝑊

4

𝑊

3

𝑊

5

𝑊

6

slide-19
SLIDE 19

Path = sequence of pairwise distinct vertices where consecutive ones are adjacent

A path 𝑟 is said to be blocked by a set 𝑇 if

𝑟 contains a chain 𝑊

𝑗 → 𝑊 𝑘 → 𝑊 𝑙 or a fork 𝑊 𝑗 ← 𝑊 𝑘 → 𝑊 𝑙 such that the

middle node is in 𝑇, or

𝑟 contains a collider 𝑊

𝑗 → 𝑊 𝑘 ← 𝑊 𝑙 such that the middle node is not in 𝑇

and such that no descendant of 𝑊

𝑘 is in S. Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 19

  • 5. Global Markov Condition

D-Separation (Pearl 1988)

D-separation: 𝑇 is said to d-separate 𝒀 and 𝒁 in the DAG 𝐻, i.e., 𝑌 ⊥ 𝑍 𝑇 𝐻, if 𝑇 blocks every path from a vertex in 𝑌 to a vertex in 𝑍.

slide-20
SLIDE 20

Example:

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 20

  • 5. Global Markov Condition

Examples of d-Separation

▪ The path from 𝑊

1 to 𝑊 6 is

blocked by 𝑊

4.

▪ 𝑊

1 and 𝑊 6 are d-separated by 𝑊 4.

▪ The path 𝑊

2 → 𝑊 3 → 𝑊 4 → 𝑊 6 is

blocked by 𝑊

3 or 𝑊 4 or both.

▪ But: 𝑊

2 and 𝑊 6 are d-separated

  • nly by 𝑊

4 or {𝑊 3, 𝑊 4}.

▪ 𝑊

1 and 𝑊 2 are not blocked by 𝑊 4.

𝑊

1

𝑊

2

𝑊

4

𝑊

3

𝑊

5

𝑊

6

slide-21
SLIDE 21

I.e., we have 𝑌 ⊥ 𝑍 𝑎 𝐻 ⇒ 𝑌 ⊥ 𝑍 𝑎 𝑄

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 21

  • 5. Global Markov Condition

Theorem

Global Markov Condition: For all disjoint subsets of vertices 𝑌, 𝑍 and 𝑎 we have that 𝑌, 𝑍 d-separated by 𝑎 ⇒ (𝑌 ⊥ 𝑍 | 𝑎)𝑄 .

Joint Distribution Data Generating Model

slide-22
SLIDE 22

Theorem: The following are equivalent:

Existence of a functional causal model 𝐻;

Local Causal Markov condition: 𝑊

𝑘 statistically independent of nondescendants,

given parents (i.e.: every information exchange with its nondescendants involves its parents)

Global Causal Markov condition: d-separation (characterizes the set of independences implied by local Markov condition)

Factorization: 𝑞 𝑤1, … , 𝑤𝑜 = ς𝑗=1

𝑜

𝑞 𝑤𝑗 𝑄𝑏 𝑤𝑗 . (subject to technical conditions)

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 22

  • 6. Functional Model and Markov Conditions

Theorem (Lauritzen 1996, Pearl 2000)

I.e., 𝒀 ⊥ 𝒁 𝒂 𝑯 ⇒ 𝒀 ⊥ 𝒁 𝒂 𝑸

slide-23
SLIDE 23

I.e., we assume that any population 𝑄 produced by this causal graph 𝐻 has the independence relations obtained by applying d-separation to it

Seems like a hefty assumption, but it really isn’t: It assumes that whatever independencies occur in it arise not from incredible coincidence but rather from structure, i.e., data generating model 𝐻.

Hence:

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 23

  • 7. Causal Faithfulness

The key-postulate

Causal Faithfulness: 𝑞 is called faithful relative to 𝐻 if only those independencies hold true that are implied by the Markov condition, i.e., 𝑌 ⊥ 𝑍 𝑎 𝐻 ⇐ 𝑌 ⊥ 𝑍 𝑎 𝑄

slide-24
SLIDE 24

Assumptions:

Causal Sufficiency

Global Markov Condition

Causal Faithfulness

Causal Structure Learning:

Accept only those DAG’s 𝐻 as causal hypothesis for which 𝑌 ⊥ 𝑍 𝑎 𝐻 ⇔ 𝑌 ⊥ 𝑍 𝑎 𝑄.

Defines the basis of constraint-based causal structure learning

Identifies causal DAG up to Markov equivalence class (DAGs that imply the same conditional independencies)

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 24

  • 8. Constraint-based Causal Inference

Concept (Spirtes, Glymor, Scheines and Pearl)

slide-25
SLIDE 25

Skeleton: corresponding undirected graph

𝑤-structure: substructure 𝑌 → 𝑍 ← 𝑎 with no edges between 𝑌 and 𝑎.

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 25

  • 9. Markov Equivalence Class

Theorem (Verma and Pearl)

Theorem: Two DAGs are Markov equivalent if and only if they have the same skeleton and the same 𝑤-structures 𝑊

1

𝑊

2

𝑊

4

𝑊

3

𝑊

5

𝑊

6

slide-26
SLIDE 26

Same skeleton, no 𝑤-structure

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 26

  • 9. Markov Equivalence Class

Examples

𝑌 𝑍 𝑎 𝑌 𝑍 𝑎 𝑌 𝑍 𝑎

𝒀 ⊥ 𝒂 | 𝒁

𝑋 𝑍 𝑎 𝑌

Same skeleton, same 𝑤-structure at 𝑋

𝑋 𝑍 𝑎 𝑌

slide-27
SLIDE 27

Causal Structures formalized by DAG (directed acyclic graph) 𝐻 with random variables 𝑊

1, … , 𝑊 𝑜 as vertices.

Causal Sufficiency, Causal Faithfulness and Markov Condition imply 𝑌 ⊥ 𝑍 𝑎 𝐻 ⇔ 𝑌 ⊥ 𝑍 𝑎 𝑄.

Local Markov Condition states that the density 𝑞(v1, … , 𝑤𝑜) then factorizes into 𝑞 𝑤1, … , 𝑤𝑜 = ς𝑗=1

𝑜

𝑞 𝑤𝑗 𝑄𝑏 𝑤𝑗 .

Causal conditional 𝑞 𝑤𝑘 𝑄𝑏 𝑤𝑘 represent causal mechanisms.

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 27

  • 10. Summary

Causal Structural Models

slide-28
SLIDE 28

Suppose, we are given the following list of conditional independencies among 𝑌, 𝑍, 𝑎 and 𝑋:

Which DAG could have generated these, and only these, independencies and dependencies?

The pattern of dependencies must be:

And there must be the following colliders:

There is no orientation of Y–Z that is consistent with the independencies.

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 28

  • 11. Excursion: Maximal Ancestral Graphs

Motivating Example

slide-29
SLIDE 29

Let’s include an additional variable U:

This DAG model generates a probability distribution 𝑄{𝑌,𝑍,𝑎,𝑋,𝑉} in which:

The marginal distribution 𝑄{𝑌,𝑍,𝑎,𝑋} = 𝑄 𝑌,𝑍,𝑎,𝑋,𝑉 𝑒𝑣 must adhere the same

  • independencies. But: this marginal distribution cannot be faithfully

generated by any DAG.

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 29

  • 11. Excursion: Maximal Ancestral Graphs

DAG Models and Marginalization

DAG models are not closed under marginalization!

slide-30
SLIDE 30

Ancestral Graph (AG) is a graph containing both directed and bi-directed edges, where the bi-directed edges stand for latent variables, e.g.,

m-Separation If S m-separates X and Y in an ancestral graph 𝑁, then X ⊥ Y | S in every density 𝑞 that factorizes according to any DAG 𝐻 that is represented by the AG 𝑁.

Example

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 30

  • 11. Excursion: Maximal Ancestral Graphs

Ancestral Graphs (informally)

slide-31
SLIDE 31

Advantages of AGs

AGs can faithfully represent more probability distributions than DAGs.

AG models are closed under marginalization.

AGs can (implicitly) represent unobserved variables, which exist in many (possibly almost all) applications.

Disadvantages of AGs

Parameterization is difficult in the general case.

Markov equivalence is difficult.

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 31

  • 11. Excursion: Maximal Ancestral Graphs

DAGs vs. AGs

slide-32
SLIDE 32

Literature

Pearl, J. (2009). Causal inference in statistics: An overview. Statistics Surveys, 3:96-146.

Pearl, J. (2009). Causality: Models, Reasoning, and Inference. Cambridge University Press.

Spirtes, P., Glymour, C., and Scheines, R. (2000). Causation, Prediction, and Search. The MIT Press.

Uflacker, Huegle, Schmidt Causal Inference

  • Theory and

Applications Slide 32

References

slide-33
SLIDE 33

Thank you for your attention!