From Statistical Transportability to Estimating the Effect of - - PowerPoint PPT Presentation

from statistical transportability to estimating the
SMART_READER_LITE
LIVE PREVIEW

From Statistical Transportability to Estimating the Effect of - - PowerPoint PPT Presentation

From Statistical Transportability to Estimating the Effect of Stochastic Interventions Juan D. Correa and Elias Bareinboim {j.d.correa, eliasb}@columbia.edu 1 Generalization Challenges 2 Generalization Challenges One of the main


slide-1
SLIDE 1

From Statistical Transportability to Estimating the Effect of Stochastic Interventions

Juan D. Correa and Elias Bareinboim

{j.d.correa, eliasb}@columbia.edu

  • 1
slide-2
SLIDE 2

Generalization Challenges

2

slide-3
SLIDE 3

Generalization Challenges

  • One of the main tasks in ML is to learn/train models of an underlying process using data

generated by the same process.

2

slide-4
SLIDE 4

Generalization Challenges

  • One of the main tasks in ML is to learn/train models of an underlying process using data

generated by the same process.

  • In fact, whenever enough data is provided, several approaches are currently capable of

learning very accurately the underlying distribution.

2

slide-5
SLIDE 5

Generalization Challenges

  • One of the main tasks in ML is to learn/train models of an underlying process using data

generated by the same process.

  • In fact, whenever enough data is provided, several approaches are currently capable of

learning very accurately the underlying distribution.

  • In practice, however, the environment in which the data is collected is almost never the

same as the one where the model is intended to be used, and will be deployed.

2

slide-6
SLIDE 6

Generalization Challenges

  • One of the main tasks in ML is to learn/train models of an underlying process using data

generated by the same process.

  • In fact, whenever enough data is provided, several approaches are currently capable of

learning very accurately the underlying distribution.

  • In practice, however, the environment in which the data is collected is almost never the

same as the one where the model is intended to be used, and will be deployed.

  • Under these constraints, the performance of the model depends on the underlying,

structural similarities between training and target environments.

2

slide-7
SLIDE 7

Statistical Transportability

3

slide-8
SLIDE 8

Statistical Transportability

3

Current Website (𝚸)
 (training environment) X Y W

(type of ad) (bought) (age)

slide-9
SLIDE 9

Statistical Transportability

3

Current Website (𝚸)
 (training environment) X Y W

(type of ad) (bought) (age) Generalization

slide-10
SLIDE 10

Statistical Transportability

3

Current Website (𝚸)
 (training environment) X Y W

(type of ad) (bought) (age)

New Website (𝚸*)
 (target environment) X Y W

(type of ad) (bought) (age) Generalization

slide-11
SLIDE 11

Statistical Transportability

3

Current Website (𝚸)
 (training environment) X Y W

(type of ad) (bought) (age)

New Website (𝚸*)
 (target environment) X Y W

(type of ad) (bought) (age) Generalization

We use to
 represent differences
 in mechanism or
 distribution

slide-12
SLIDE 12

Statistical Transportability

P(W) ≠ P*(W) hence P(y | x) ≠ P*(y | x)

3

Current Website (𝚸)
 (training environment) X Y W

(type of ad) (bought) (age)

New Website (𝚸*)
 (target environment) X Y W

(type of ad) (bought) (age) Generalization

We use to
 represent differences
 in mechanism or
 distribution

slide-13
SLIDE 13

Statistical Transportability

4

Current Website (𝚸)
 (training environment) X Y W

(type of ad) (bought) (age)

New Website (𝚸*)
 (target environment) X Y W

(type of ad) (bought) (age) Generalization

slide-14
SLIDE 14

Statistical Transportability

  • How to generalize the model learned in the source environment to different (but related)

target environments?

4

Current Website (𝚸)
 (training environment) X Y W

(type of ad) (bought) (age)

New Website (𝚸*)
 (target environment) X Y W

(type of ad) (bought) (age) Generalization

slide-15
SLIDE 15

Statistical Transportability

  • How to generalize the model learned in the source environment to different (but related)

target environments?

  • Do we need to obtain samples from 𝚸* and train a new model?

4

Current Website (𝚸)
 (training environment) X Y W

(type of ad) (bought) (age)

New Website (𝚸*)
 (target environment) X Y W

(type of ad) (bought) (age) Generalization

slide-16
SLIDE 16

Statistical Transportability

5

Current Website (𝚸)
 (training environment) X Y W

(type of ad) (bought) (age)

New Website (𝚸*)
 (target environment) X Y W

(type of ad) (bought) (age) Generalization

slide-17
SLIDE 17

Statistical Transportability

5

Current Website (𝚸)
 (training environment) X Y W

(type of ad) (bought) (age)

New Website (𝚸*)
 (target environment) X Y W

(type of ad) (bought) (age) Generalization

We observe P(x,y,w)

slide-18
SLIDE 18

Statistical Transportability

5

Current Website (𝚸)
 (training environment) X Y W

(type of ad) (bought) (age)

New Website (𝚸*)
 (target environment) X Y W

(type of ad) (bought) (age) Generalization

We observe P(x,y,w) We want to say something about P*(y|x)

slide-19
SLIDE 19

Statistical Transportability

5

Current Website (𝚸)
 (training environment) X Y W

(type of ad) (bought) (age)

New Website (𝚸*)
 (target environment) X Y W

(type of ad) (bought) (age) Generalization

We observe P(x,y,w) We want to say something about P*(y|x)

P(x,y,w)=P(w) P(x|w) P(y|x,w)

slide-20
SLIDE 20

Statistical Transportability

5

Current Website (𝚸)
 (training environment) X Y W

(type of ad) (bought) (age)

New Website (𝚸*)
 (target environment) X Y W

(type of ad) (bought) (age) Generalization

We observe P(x,y,w) We want to say something about P*(y|x)

P(x,y,w)=P(w) P(x|w) P(y|x,w)

are the same in both environments, 
 which is implied by this causal model.

slide-21
SLIDE 21

Statistical Transportability

6

New Website (𝚸*)
 (target environment) X Y W

(type of ad) (bought) (age)

slide-22
SLIDE 22

Statistical Transportability

6

  • The target distribution P*(y|x) can be expressed as:

New Website (𝚸*)
 (target environment) X Y W

(type of ad) (bought) (age)

slide-23
SLIDE 23

Statistical Transportability

6

  • The target distribution P*(y|x) can be expressed as:

New Website (𝚸*)
 (target environment) X Y W

(type of ad) (bought) (age)

P*(y|x) = P*(y, x) P*(x) = ∑w P*(y|x, w)P*(x|w)P*(w) ∑w P*(x|w)P*(w)

slide-24
SLIDE 24

Statistical Transportability

6

  • The target distribution P*(y|x) can be expressed as:

New Website (𝚸*)
 (target environment) X Y W

(type of ad) (bought) (age)

P*(y|x) = P*(y, x) P*(x) = ∑w P*(y|x, w)P*(x|w)P*(w) ∑w P*(x|w)P*(w)

are the same in source and target

slide-25
SLIDE 25

Statistical Transportability

6

  • The target distribution P*(y|x) can be expressed as:

New Website (𝚸*)
 (target environment) X Y W

(type of ad) (bought) (age)

P*(y|x) = P*(y, x) P*(x) = ∑w P*(y|x, w)P*(x|w)P*(w) ∑w P*(x|w)P*(w)

slide-26
SLIDE 26

Statistical Transportability

6

  • The target distribution P*(y|x) can be expressed as:

New Website (𝚸*)
 (target environment) X Y W

(type of ad) (bought) (age)

P*(y|x) = P*(y, x) P*(x) = ∑w P*(y|x, w)P*(x|w)P*(w) ∑w P*(x|w)P*(w) = ∑w P(y|x, w)P(x|w)P*(w) ∑w P(x|w)P*(w)

slide-27
SLIDE 27

Statistical Transportability

6

  • The target distribution P*(y|x) can be expressed as:
  • Under the assumptions implied by the diagram, only

P*(w) needs to be measured in the target environment,

while the other distributions can be reused from the data collected in the source environment. New Website (𝚸*)
 (target environment) X Y W

(type of ad) (bought) (age)

P*(y|x) = P*(y, x) P*(x) = ∑w P*(y|x, w)P*(x|w)P*(w) ∑w P*(x|w)P*(w) = ∑w P(y|x, w)P(x|w)P*(w) ∑w P(x|w)P*(w)

slide-28
SLIDE 28

Deciding Transportability

7

slide-29
SLIDE 29

Deciding Transportability

7

Source (𝚸) Target (𝚸*)

Selection Diagram D

slide-30
SLIDE 30

Deciding Transportability

7

Source (𝚸) Target (𝚸*)

Selection Diagram D

P(v)

Distribution learned
 from 𝛒

slide-31
SLIDE 31

Deciding Transportability

7

Source (𝚸) Target (𝚸*)

Selection Diagram D

P(v)

Distribution learned
 from 𝛒

P*(w)

Partial distribution
 from 𝛒*

slide-32
SLIDE 32

Deciding Transportability

7

Source (𝚸) Target (𝚸*)

Selection Diagram D Is there a function f such that


?

P*(y|x) = f(P(v), P*(w))

P(v)

Distribution learned
 from 𝛒

P*(w)

Partial distribution
 from 𝛒*

slide-33
SLIDE 33

Deciding Transportability

7

Source (𝚸) Target (𝚸*)

Selection Diagram D Is there a function f such that


?

P*(y|x) = f(P(v), P*(w))

P(v)

Distribution learned
 from 𝛒

P*(w)

Partial distribution
 from 𝛒*

yes ( ) / no

f

😁 ☹

slide-34
SLIDE 34

Proposed Strategy

8

slide-35
SLIDE 35

Proposed Strategy

Encode the assumptions about the differences and commonalities across environments.

8

1

slide-36
SLIDE 36

Proposed Strategy

Encode the assumptions about the differences and commonalities across environments.

8

Selection diagrams (with )

1

slide-37
SLIDE 37

Proposed Strategy

Encode the assumptions about the differences and commonalities across environments. Identify the stable mechanisms across environments.

8

Selection diagrams (with )

1 2

slide-38
SLIDE 38

Proposed Strategy

Encode the assumptions about the differences and commonalities across environments. Identify the stable mechanisms across environments. Determine the variables that need to be re- measured.

8

Selection diagrams (with )

1 2 3

slide-39
SLIDE 39

Proposed Strategy

Encode the assumptions about the differences and commonalities across environments. Identify the stable mechanisms across environments. Determine the variables that need to be re- measured. Construct an estimator from the available data.

8

Selection diagrams (with )

1 2 3 4

slide-40
SLIDE 40

Proposed Strategy

Encode the assumptions about the differences and commonalities across environments. Identify the stable mechanisms across environments. Determine the variables that need to be re- measured. Construct an estimator from the available data.

8

Selection diagrams (with ) Exploit Causality Theory

1 2 3 4

slide-41
SLIDE 41

Results

9

slide-42
SLIDE 42

Results

We introduce a novel graphical decomposition of the observed/learned distribution into factors that take into account the latent structure, which generalizes C- components (Tian & Pearl 2002), and is suitable to reason about distributions with different sets of measured variables.

9

1

slide-43
SLIDE 43

Results

We introduce a novel graphical decomposition of the observed/learned distribution into factors that take into account the latent structure, which generalizes C- components (Tian & Pearl 2002), and is suitable to reason about distributions with different sets of measured variables. We derive a complete algorithm that determines if a distribution P*(y|x) can be uniquely identified from distributions P(v) and P*(w) (W ⊆ V) based on the assumptions encoded in graphs corresponding to the source and target domains.

9

1 2

slide-44
SLIDE 44

Results

We introduce a novel graphical decomposition of the observed/learned distribution into factors that take into account the latent structure, which generalizes C- components (Tian & Pearl 2002), and is suitable to reason about distributions with different sets of measured variables. We derive a complete algorithm that determines if a distribution P*(y|x) can be uniquely identified from distributions P(v) and P*(w) (W ⊆ V) based on the assumptions encoded in graphs corresponding to the source and target domains. We connect this problem with the problem of identifying the effect of stochastic plans and how it reduces to the former problem.

9

1 2 3

slide-45
SLIDE 45

Factorization of Observed Distributions

10

slide-46
SLIDE 46

Factorization of Observed Distributions

  • The Markov property leads to a natural factorization when all variables are observed, ie:

10

X Y Z U

slide-47
SLIDE 47

Factorization of Observed Distributions

  • The Markov property leads to a natural factorization when all variables are observed, ie:

10

P(v) = ∏

i

P(vi|pai) = P(x|u)P(z|x)P(y|z, u)P(u)

(where V is the set of all observable variables)

X Y Z U

slide-48
SLIDE 48

Factorization of Observed Distributions

  • The Markov property leads to a natural factorization when all variables are observed, ie:

10

X Y Z U

P(v) = ∏

i

P(vi|pai) = P(x|u)P(z|x)P(y|z, u)P(u)

(where V is the set of all observable variables)

  • How to factorize the observed distribution in the presence of latent variables?

X Y Z U

slide-49
SLIDE 49

Factorization of Observed Distributions

  • The Markov property leads to a natural factorization when all variables are observed, ie:

10

X Y Z U

P(v) = ∏

i

P(vi|pai) = P(x|u)P(z|x)P(y|z, u)P(u)

(where V is the set of all observable variables)

  • How to factorize the observed distribution in the presence of latent variables?

P(v) = ∑

u

P(x, z, y, u) = ∑

u

P(x|u)P(z|x)P(y|z, u)P(u)

X Y Z U

slide-50
SLIDE 50

Factorization of Observed Distributions

  • The Markov property leads to a natural factorization when all variables are observed, ie:

10

X Y Z U

P(v) = ∏

i

P(vi|pai) = P(x|u)P(z|x)P(y|z, u)P(u)

(where V is the set of all observable variables)

  • How to factorize the observed distribution in the presence of latent variables?

P(v) = ∑

u

P(x, z, y, u) = ∑

u

P(x|u)P(z|x)P(y|z, u)P(u) = P(z|x)(∑

u

P(x|u)P(y|z, u)P(u))

X Y Z U

slide-51
SLIDE 51

= P(x)P(z|x)(∑

x′

P(y|z, x′)P(x′))

Factorization of Observed Distributions

  • The Markov property leads to a natural factorization when all variables are observed, ie:

10

X Y Z U

P(v) = ∏

i

P(vi|pai) = P(x|u)P(z|x)P(y|z, u)P(u)

(where V is the set of all observable variables)

  • How to factorize the observed distribution in the presence of latent variables?

P(v) = ∑

u

P(x, z, y, u) = ∑

u

P(x|u)P(z|x)P(y|z, u)P(u) = P(z|x)(∑

u

P(x|u)P(y|z, u)P(u))

Causal Inference tools give us the means to identify some factors involving latent variables from

  • bserved distributions.

X Y Z U

slide-52
SLIDE 52

= P(x)P(z|x)(∑

x′

P(y|z, x′)P(x′))

Factorization of Observed Distributions

  • The Markov property leads to a natural factorization when all variables are observed, ie:

10

X Y Z U

P(v) = ∏

i

P(vi|pai) = P(x|u)P(z|x)P(y|z, u)P(u)

(where V is the set of all observable variables)

  • How to factorize the observed distribution in the presence of latent variables?

P(v) = ∑

u

P(x, z, y, u) = ∑

u

P(x|u)P(z|x)P(y|z, u)P(u) = P(z|x)(∑

u

P(x|u)P(y|z, u)P(u))

Causal Inference tools give us the means to identify some factors involving latent variables from

  • bserved distributions.

X Y Z U

Q[Y]

slide-53
SLIDE 53

A slightly more complicated example

11

slide-54
SLIDE 54

A slightly more complicated example

11

Z Y F X A B D Source (𝚸)

slide-55
SLIDE 55

A slightly more complicated example

11

Z Y F X A B D Source (𝚸) Z Y F X A B D Target (𝚸*)

slide-56
SLIDE 56

A slightly more complicated example

  • Suppose the inferential target is P*(y|x,z). After some algebra, one can show that given

P(b,z,f,d,x,a,y) and P*(x,a), it can be written as

11

Z Y F X A B D Source (𝚸) Z Y F X A B D Target (𝚸*) Z Y F X A B D Needed factors (𝚸*)

slide-57
SLIDE 57

A slightly more complicated example

  • Suppose the inferential target is P*(y|x,z). After some algebra, one can show that given

P(b,z,f,d,x,a,y) and P*(x,a), it can be written as

11

Z Y F X A B D Source (𝚸) Z Y F X A B D Target (𝚸*) Z Y F X A B D Needed factors (𝚸*)

P*(y|x, z) = ∑

a,d

Q*[A, X]Q[D]Q[Y]/ ∑

a,d,y

Q*[A, X]Q[D]Q[Y]

slide-58
SLIDE 58

A slightly more complicated example

  • Suppose the inferential target is P*(y|x,z). After some algebra, one can show that given

P(b,z,f,d,x,a,y) and P*(x,a), it can be written as

11

Z Y F X A B D Source (𝚸) Z Y F X A B D Target (𝚸*) Z Y F X A B D Needed factors (𝚸*)

P*(y|x, z) = ∑

a,d

Q*[A, X]Q[D]Q[Y]/ ∑

a,d,y

Q*[A, X]Q[D]Q[Y] P*(y|x, z) = ∑

a

P*(a|x)∑

d

P(d|z)∑

z′

P(y|x, z′, d, a)P(z′)

slide-59
SLIDE 59

Dynamic Plan Identification reduces to Statistical Transportability

12

slide-60
SLIDE 60

Dynamic Plan Identification reduces to Statistical Transportability

Key observation. If the source environment corresponds to the current system, and the target environment corresponds to the source after an intervention, then transporting the distribution P*(y) is the same as identifying the effect of the intervention on an outcome Y.

12

slide-61
SLIDE 61

Dynamic Plan Identification reduces to Statistical Transportability

Key observation. If the source environment corresponds to the current system, and the target environment corresponds to the source after an intervention, then transporting the distribution P*(y) is the same as identifying the effect of the intervention on an outcome Y.

12

X Y W

(tutoring) (GPA) (previous GPA)

Z

(motivation)

Students get tutoring on their own volition based

  • n their motivation.
slide-62
SLIDE 62

Dynamic Plan Identification reduces to Statistical Transportability

Key observation. If the source environment corresponds to the current system, and the target environment corresponds to the source after an intervention, then transporting the distribution P*(y) is the same as identifying the effect of the intervention on an outcome Y.

12

X Y W

(tutoring) (GPA) (previous GPA)

Z

(motivation)

X Y W

(tutoring) (GPA) (previous GPA)

Z

(motivation) σX Intervention σX

Assign tutoring only to students with low GPA.

Students get tutoring on their own volition based

  • n their motivation.
slide-63
SLIDE 63

Dynamic Plan Identification reduces to Statistical Transportability

Key observation. If the source environment corresponds to the current system, and the target environment corresponds to the source after an intervention, then transporting the distribution P*(y) is the same as identifying the effect of the intervention on an outcome Y.

12

X Y W

(tutoring) (GPA) (previous GPA)

Z

(motivation)

X Y W

(tutoring) (GPA) (previous GPA)

Z

(motivation) σX Intervention σX

Assign tutoring only to students with low GPA.

Students get tutoring on their own volition based

  • n their motivation.

P*(y) represents the effect of on Y.

σX

slide-64
SLIDE 64

Conclusions

13

slide-65
SLIDE 65

Conclusions

  • Leveraging causal inference tools, we solved the problem of generalizability of

probability distributions across different, but related environments.

13

slide-66
SLIDE 66

Conclusions

  • Leveraging causal inference tools, we solved the problem of generalizability of

probability distributions across different, but related environments.

  • We proposed a sound and complete procedure to decide whether a target distribution is

transportable from observations in a source domain and partial measurements in the target domain, following the assumptions encoded in graphical models representing the data generating process in the domains.

13

slide-67
SLIDE 67

Conclusions

  • Leveraging causal inference tools, we solved the problem of generalizability of

probability distributions across different, but related environments.

  • We proposed a sound and complete procedure to decide whether a target distribution is

transportable from observations in a source domain and partial measurements in the target domain, following the assumptions encoded in graphical models representing the data generating process in the domains.

  • Leveraging these results, we solved the problem of identification of stochastic

interventions.

13

slide-68
SLIDE 68

Thank you!

14