Learning Causal Structures via Gradient-Based Optimization Sbastien - - PowerPoint PPT Presentation

learning causal structures via gradient based optimization
SMART_READER_LITE
LIVE PREVIEW

Learning Causal Structures via Gradient-Based Optimization Sbastien - - PowerPoint PPT Presentation

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion Learning Causal Structures via Gradient-Based Optimization Sbastien Lachapelle Mila, Universit de Montral March 4th, 2020 Sbastien Lachapelle Mila EAI


slide-1
SLIDE 1

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Learning Causal Structures via Gradient-Based Optimization

Sébastien Lachapelle

Mila, Université de Montréal

March 4th, 2020

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 1 / 40

slide-2
SLIDE 2

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Overview

Causality Framework

Causal Graphical Models Motivating example Markov Equivalence and Structure Identifiability

Causal Structure Learning

Problem formulation Discrete Search Algorithms Gradient-Based Algorithms

GraN-DAG & extensions

The algorithm With interventional data Neural Autoregressive Flows

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 2 / 40

slide-3
SLIDE 3

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Causal graphical models (CGM)

Random vector X ∈ Rd (d variables) Let G be a directed acyclic graph (DAG) Assume p(x) = d

i=1 p(xi|xπG

i )

πG

i

= parents of i in G Encodes (conditional) independence statements (via d-separation, see [Koller & Friedman, 2009]) Almost identical to Bayesian Networks but allows for interventional distributions: p(x|do(z)) Simple example G = (V, E) p(x, y, z) = p(x)p(z | x)p(y | z) = ⇒ p(x, y|z) = p(x|z)p(y|z) i.e. X | = Y | Z The do operator will be explained in the following example...

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 3 / 40

slide-4
SLIDE 4

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Why should you care: Kidney Stone Treatment

T = Treatment ∈ {A, B} Z = Stone size ∈ {small, large} R = Patient recovered ∈ {0, 1} (Example taken from Element of Causal Inference by Peters et al. p111)

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 4 / 40

slide-5
SLIDE 5

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Why should you care: Kidney Stone Treatment

Pay attention to these two questions... Assuming the size of your stone is unknown...

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 5 / 40

slide-6
SLIDE 6

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Why should you care: Kidney Stone Treatment

Pay attention to these two questions... Assuming the size of your stone is unknown... What is your chance of recovery knowing that the doctor gave you treatment A? What is your chance of recovery if you decide to take treatment A?

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 5 / 40

slide-7
SLIDE 7

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Why should you care: Kidney Stone Treatment

T = Treatment ∈ {A, B} Z = Stone size ∈ {small, large} R = Patient recovered ∈ {0, 1} What is your chance of recovery knowing that the doctor gave you treatment A? Knowing that your doctor gave you treatment A tells you that you probably have a large kidney stone ... P(Z = large|T = A) = 0.75 ... which reduces your chance of recovery P(R = 1|T = A, Z = large) = 0.73 < 0.93 = P(R = 1|T = A, Z = small)

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 6 / 40

slide-8
SLIDE 8

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Why should you care: Kidney Stone Treatment

T = Treatment ∈ {A, B} Z = Stone size ∈ {small, large} R = Patient recovered ∈ {0, 1} What is your chance of recovery knowing that the doctor gave you treatment A? Knowing that your doctor gave you treatment A tells you that you probably have a large kidney stone ... P(Z = large|T = A) = 0.75 ... which reduces your chance of recovery P(R = 1|T = A, Z = large) = 0.73 < 0.93 = P(R = 1|T = A, Z = small) What is your chance of recovery if you decide to take treatment A? Your really don’t know anything about your kidney stone You taking treatment A is not a function of any variable

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 6 / 40

slide-9
SLIDE 9

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Why should you care: Kidney Stone Treatment

T = Treatment ∈ {A, B} Z = Stone size ∈ {small, large} R = Patient recovered ∈ {0, 1} What is your chance of recovery knowing that the doctor gave you treatment A? P(R = 1|T = A) = 0, 78 P(R = 1|T = B) = 0,83

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 7 / 40

slide-10
SLIDE 10

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Why should you care: Kidney Stone Treatment

T = Treatment ∈ {A, B} Z = Stone size ∈ {small, large} R = Patient recovered ∈ {0, 1} What is your chance of recovery knowing that the doctor gave you treatment A? P(R = 1|T = A) = 0, 78 P(R = 1|T = B) = 0,83 What is your chance of recovery if you decide to take treatment A? P(R = 1|do(T = A)) = 0,832 P(R = 1|do(T = B)) = 0, 782

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 7 / 40

slide-11
SLIDE 11

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Why should you care: Kidney Stone Treatment

T = Treatment ∈ {A, B} Z = Stone size ∈ {small, large} R = Patient recovered ∈ {0, 1} What is your chance of recovery knowing that the doctor gave you treatment A? P(R = 1|T = A) = 0, 78 P(R = 1|T = B) = 0,83 What is your chance of recovery if you decide to take treatment A? P(R = 1|do(T = A)) = 0,832 P(R = 1|do(T = B)) = 0, 782 But how do we compute these interventional distributions?!

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 7 / 40

slide-12
SLIDE 12

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Why should you care: Kidney Stone Treatment

T = Treatment ∈ {A, B} Z = Stone size ∈ {small, large} R = Patient recovered ∈ {0, 1} P(R, Z|do(T = A)) = P(R|Z, T = A) P(T = A|Z)

  • The decision of taking treatment A

does not depend on Z anymore P(Z) Then simply marginalize as usual: P(R = 1|do(T = A)) =

  • Z

P(R = 1, Z|do(T = A)) =

  • Z

P(R = 1|Z, T = A)P(Z) = 0, 832

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 8 / 40

slide-13
SLIDE 13

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Structure Learning

In the kidney stone example, the causal graph was known What if we don’t have it? Learn it! Purely observational data X1 X2 X3 sample 1 1.76 10.46 0.002 sample2 3.42 78.6 0.011 ... ... sample n 4.56 9.35 1.96

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 9 / 40

slide-14
SLIDE 14

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Structure Learning

In the kidney stone example, the causal graph was known What if we don’t have it? Learn it! Purely observational data X1 X2 X3 sample 1 1.76 10.46 0.002 sample2 3.42 78.6 0.011 ... ... sample n 4.56 9.35 1.96 Is it even possible?

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 9 / 40

slide-15
SLIDE 15

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Identifiability

In general, this is impossible without interventional data...

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 10 / 40

slide-16
SLIDE 16

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Identifiability

In general, this is impossible without interventional data... Multiple DAGs can express the same distribution...

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 10 / 40

slide-17
SLIDE 17

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Identifiability

If we assume causal mechanisms are "simple", then G can be identified...

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 11 / 40

slide-18
SLIDE 18

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Identifiability

If we assume causal mechanisms are "simple", then G can be identified... An example (useful later!) If data follows this model... Xi|XπG

i

∼ N(fi(XπG

i ), σ2

i )

...then correct causal DAG G can be identified from purely observational data (see [Peters et al., 2014] for proof and regularity conditions)

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 11 / 40

slide-19
SLIDE 19

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Structure Learning

X1 X2 X3 sample 1 1.76 10.46 0.002 sample2 3.42 78.6 0.011 ... ... sample n 4.56 9.35 1.96 Score-based algorithms ˆ G = arg max

G∈DAG

Score(G) Often, Score(G) = regularized maximum likelihood under G

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 12 / 40

slide-20
SLIDE 20

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Structure Learning

Taxonomy of score-based algorithms (non-exhaustive) Discrete optim. Continuous optim. Linear GES [Chickering, 2003] NOTEARS [Zheng et al., 2018] Nonlinear CAM [Bühlmann et al., 2014] GraN-DAG [Lachapelle et al., 2020]

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 13 / 40

slide-21
SLIDE 21

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Structure Learning

Taxonomy of score-based algorithms (non-exhaustive) Discrete optim. Continuous optim. Linear GES [Chickering, 2003] NOTEARS [Zheng et al., 2018] Nonlinear CAM [Bühlmann et al., 2014] GraN-DAG [Lachapelle et al., 2020]

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 14 / 40

slide-22
SLIDE 22

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

A greedy algorithm - CAM [Bühlmann et al., 2014]

Figures from [Bühlmann et al., 2014]

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 15 / 40

slide-23
SLIDE 23

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Structure Learning

Taxonomy of score-based algorithms (non-exhaustive) Discrete optim. Continuous optim. Linear GES [Chickering, 2003] NOTEARS [Zheng et al., 2018] Nonlinear CAM [Bühlmann et al., 2014] GraN-DAG [Lachapelle et al., 2020]

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 16 / 40

slide-24
SLIDE 24

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

NOTEARS: Continuous optimization for structure learning

Encode graph as a weighted adjacency matrix U = [u1| . . . |ud] ∈ Rd×d A =      1 1 1     

  • Adjacency matrix

U =      4.8 0.2 −1.7     

  • Weighted adjacency matrix

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 17 / 40

slide-25
SLIDE 25

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

NOTEARS: Continuous optimization for structure learning

Encode graph as a weighted adjacency matrix U = [u1| . . . |ud] ∈ Rd×d A =      1 1 1     

  • Adjacency matrix

U =      4.8 0.2 −1.7     

  • Weighted adjacency matrix

Represents coefficients in a linear model: Xi := u⊤

i X + noisei ∀i

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 17 / 40

slide-26
SLIDE 26

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

NOTEARS: Continuous optimization for structure learning

Encode graph as a weighted adjacency matrix U = [u1| . . . |ud] ∈ Rd×d A =      1 1 1     

  • Adjacency matrix

U =      4.8 0.2 −1.7     

  • Weighted adjacency matrix

Represents coefficients in a linear model: Xi := u⊤

i X + noisei ∀i

For an arbitrary U, associated graph might be cyclic Acyclicity constraint NOTEARS [Zheng et al., 2018] uses this differentiable acyclicity constraint: Tr eU⊙U − d = 0

 eM ∞

  • k=0

Mk k!   Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 17 / 40

slide-27
SLIDE 27

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

NOTEARS: Continuous optimization for structure learning

NOTEARS [Zheng et al., 2018]: Solve this continuous constrained optimization problem: max

U

−X − XU2

F − λU1

  • Score

s.t. Tr eU⊙U − d = 0 where X ∈ Rn×d is the design matrix containing all n samples

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 18 / 40

slide-28
SLIDE 28

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

NOTEARS: Continuous optimization for structure learning

NOTEARS [Zheng et al., 2018]: Solve this continuous constrained optimization problem: max

U

−X − XU2

F − λU1

  • Score

s.t. Tr eU⊙U − d = 0 where X ∈ Rn×d is the design matrix containing all n samples Solve approximately using an Augmented Lagrangian method Amounts to maximizing (with gradient ascent) −X − XU2

F − λU1−αt(Tr eU⊙U − d) − µt

2 (Tr eU⊙U − d)2 while gradually increasing αt and µt

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 18 / 40

slide-29
SLIDE 29

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

NOTEARS: The acyclicity constraint

Tr eU⊙U − d = 0

 eM ∞

  • k=0

Mk k!  

Suppose A ∈ {0, 1}d×d is an adjacency matrix for a certain directed graph

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 19 / 40

slide-30
SLIDE 30

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

NOTEARS: The acyclicity constraint

Tr eU⊙U − d = 0

 eM ∞

  • k=0

Mk k!  

Suppose A ∈ {0, 1}d×d is an adjacency matrix for a certain directed graph (Ak)ii = number of cycles of length k passing through i

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 19 / 40

slide-31
SLIDE 31

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

NOTEARS: The acyclicity constraint

Tr eU⊙U − d = 0

 eM ∞

  • k=0

Mk k!  

Suppose A ∈ {0, 1}d×d is an adjacency matrix for a certain directed graph (Ak)ii = number of cycles of length k passing through i Graph acyclic ⇐ ⇒ (Ak)ii = 0 for all i and all k

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 19 / 40

slide-32
SLIDE 32

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

NOTEARS: The acyclicity constraint

Tr eU⊙U − d = 0

 eM ∞

  • k=0

Mk k!  

Suppose A ∈ {0, 1}d×d is an adjacency matrix for a certain directed graph (Ak)ii = number of cycles of length k passing through i Graph acyclic ⇐ ⇒ (Ak)ii = 0 for all i and all k ⇐ ⇒ Tr ∞

k=1 Ak k!

  • = 0

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 19 / 40

slide-33
SLIDE 33

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

NOTEARS: The acyclicity constraint

Tr eU⊙U − d = 0

 eM ∞

  • k=0

Mk k!  

Suppose A ∈ {0, 1}d×d is an adjacency matrix for a certain directed graph (Ak)ii = number of cycles of length k passing through i Graph acyclic ⇐ ⇒ (Ak)ii = 0 for all i and all k ⇐ ⇒ Tr ∞

k=1 Ak k!

  • = 0

⇐ ⇒ Tr ∞

k=0 Ak k! − A0

= 0

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 19 / 40

slide-34
SLIDE 34

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

NOTEARS: The acyclicity constraint

Tr eU⊙U − d = 0

 eM ∞

  • k=0

Mk k!  

Suppose A ∈ {0, 1}d×d is an adjacency matrix for a certain directed graph (Ak)ii = number of cycles of length k passing through i Graph acyclic ⇐ ⇒ (Ak)ii = 0 for all i and all k ⇐ ⇒ Tr ∞

k=1 Ak k!

  • = 0

⇐ ⇒ Tr ∞

k=0 Ak k! − A0

= 0 ⇐ ⇒ Tr eA − d = 0

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 19 / 40

slide-35
SLIDE 35

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

NOTEARS: The acyclicity constraint

Tr eU⊙U − d = 0

 eM ∞

  • k=0

Mk k!  

Suppose A ∈ {0, 1}d×d is an adjacency matrix for a certain directed graph (Ak)ii = number of cycles of length k passing through i Graph acyclic ⇐ ⇒ (Ak)ii = 0 for all i and all k ⇐ ⇒ Tr ∞

k=1 Ak k!

  • = 0

⇐ ⇒ Tr ∞

k=0 Ak k! − A0

= 0 ⇐ ⇒ Tr eA − d = 0 The argument is almost identical when using weighted adjacency U instead of A...

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 19 / 40

slide-36
SLIDE 36

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Structure Learning

Taxonomy of score-based algorithms (non-exhaustive) Discrete optim. Continuous optim. Linear GES [Chickering, 2003] NOTEARS [Zheng et al., 2018] Nonlinear CAM [Bühlmann et al., 2014] GraN-DAG [Lachapelle et al., 2020]

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 20 / 40

slide-37
SLIDE 37

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Gradient-Based Neural DAG Learning

φ(i) {W (1)

(i) , . . . , W (L+1) (i)

} W (ℓ)

(i) = ℓth weight matrix of NNφ(i)

φ {φ(i)}d

i=1 Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 21 / 40

slide-38
SLIDE 38

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Gradient-Based Neural DAG Learning

φ(i) {W (1)

(i) , . . . , W (L+1) (i)

} W (ℓ)

(i) = ℓth weight matrix of NNφ(i)

φ {φ(i)}d

i=1

d

i=1 p(xi|x−i; θ(i)) does not decompose according to a DAG!

We need to constrain the networks to be acyclic! How?

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 21 / 40

slide-39
SLIDE 39

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Gradient-Based Neural DAG Learning

Key idea: Construct a weighted adjacency matrix Aφ (analogous to U from the linear case) which could be used in the acyclicity constraint Then maximize likelihood under acyclicity constraint via augmented Lagrangian max

φ

E

X∼PX d

  • i=0

log pφ(Xi|X−i)−αt(Tr eAφ − d) − µt 2 (Tr eAφ − d)2

  • Augmented Lagrangian

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 22 / 40

slide-40
SLIDE 40

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Constructing weighted adjacency matrix Aφ

Let’s measure the "strength" of edge Xj → Xi

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 23 / 40

slide-41
SLIDE 41

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Constructing weighted adjacency matrix Aφ

Let’s measure the "strength" of edge Xj → Xi Path product: |W (1)

h1j ||W (2) h2h1||W (3) kh2| ≥ 0

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 24 / 40

slide-42
SLIDE 42

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Constructing weighted adjacency matrix Aφ

Let’s measure the "strength" of edge Xj → Xi Path product: |W (1)

h1j ||W (2) h2h1||W (3) kh2| ≥ 0

C |W (3)||W (2)||W (1)| "Connection strength" from Xj to θ(i) : m

k=1 Ckj ≥ 0

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 25 / 40

slide-43
SLIDE 43

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Constructing weighted adjacency matrix Aφ

Let’s measure the "strength" of edge Xj → Xi Path product: |W (1)

h1j ||W (2) h2h1||W (3) kh2| ≥ 0

C |W (3)||W (2)||W (1)| "Connection strength" from Xj to θ(i) : m

k=1 Ckj ≥ 0

m

k=1 Ckj = 0 ⇒ All paths from Xj to Xi are inactive!

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 25 / 40

slide-44
SLIDE 44

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Constructing weighted adjacency matrix Aφ

Let’s measure the "strength" of edge Xj → Xi Path product: |W (1)

h1j ||W (2) h2h1||W (3) kh2| ≥ 0

C |W (3)||W (2)||W (1)| "Connection strength" from Xj to θ(i) : m

k=1 Ckj ≥ 0

m

k=1 Ckj = 0 ⇒ All paths from Xj to Xi are inactive!

  • ji

   m

k=1

  • C(i)
  • kj ,

if i = j 0,

  • therwise

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 25 / 40

slide-45
SLIDE 45

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Gradient-Based Neural DAG Learning

The algorithm:

1 Preliminary neighborhood selection (analogous to CAM)

i.e. for each node, select potential parents via any variable selection approach

2 Maximize likelihood under acyclicity constraint via augmented Lagrangian

max

φ

E

X∼PX d

  • i=0

log pφ(xi|x−i)−αt(Tr eAφ − d) − µt 2 (Tr eAφ − d)2

  • Augmented Lagrangian

3 DAG Pruning (analogous to CAM)

i.e. for each node, get rid of some parents via any variable selection approach Step 1 and 3 helps reducing overfitting. Important since adding edges cannot reduce maximum likelihood

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 26 / 40

slide-46
SLIDE 46

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Gradient-Based Neural DAG Learning

Correct edges Wrong edges

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 27 / 40

slide-47
SLIDE 47

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Experiments

Synthetic data: Xi|XπG

i

∼ N(fi(XπG ), σ2

i )

fi ∼ Gaussian Process Models: GraN-DAG, NOTEARS and CAM makes the Gaussian assumption Real data: Measurements of expression levels of proteins and phospholipids in human immune system cells [Sachs et al., 2005] Synthetic (50 nodes) Protein data set SHD SID SHD SID Continuous GraN-DAG 102.6±21.2 1060.1±109.4 13 47 DAG-GNN 191.9±15.2 2146.2±64 16 44 NOTEARS 202.3±14.3 2149.1±76.3 21 44 Discrete CAM 98.8±20.7 1197.2±125.9 12 55 RANDOM 708.4±234.4 1921.3±203.5 21 60

DAG-GNN [Yu et al., 2019]

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 28 / 40

slide-48
SLIDE 48

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Experiments

In previous setup, synthetic data generation and model matched Here: model misspecification

GSF [Huang et al., 2018a]

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 29 / 40

slide-49
SLIDE 49

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Experiments: Effect of sample size

Previous experiment: relatively small dataset: 1000 examples GraN-DAG is more expressive than CAM Advantage shows up in large sample size regimes

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 30 / 40

slide-50
SLIDE 50

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

GraN-DAG with interventions [Brouillard et al., 2020]

Can we make use of interventional data?

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 31 / 40

slide-51
SLIDE 51

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

GraN-DAG with interventions [Brouillard et al., 2020]

Some terminology and setting: I ⊂ {1, ..., n} is an interventional target (set of nodes on which we intervene) Definition of stochastic intervention: p(x1, ..., xd|do(XI))

  • j /

∈I

pj(xj|xπG

j )

  • j∈I

˜ pj(xj) where ˜ pj(xj) is the new marginal replacing pj(xj|xπG

j ) (parents are "cut out") Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 32 / 40

slide-52
SLIDE 52

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

GraN-DAG with interventions [Brouillard et al., 2020]

Some terminology and setting: I ⊂ {1, ..., n} is an interventional target (set of nodes on which we intervene) Definition of stochastic intervention: p(x1, ..., xd|do(XI))

  • j /

∈I

pj(xj|xπG

j )

  • j∈I

˜ pj(xj) where ˜ pj(xj) is the new marginal replacing pj(xj|xπG

j ) (parents are "cut out")

Observed: {(X (1), I(1)), ..., (X (n), I(n))} where I(i) is the interventional target associated to observation X (i). I(i) ∼ P(I) i.i.d. ∀i X (i)|I(i) ∼ P(X|I = I(i)) p(x1, ..., xd|do(XI(i))) ∀i (1) where P(I) is a distribution over a collection of interventional targets I

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 32 / 40

slide-53
SLIDE 53

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

GraN-DAG with interventions [Brouillard et al., 2020]

Think about a CGM as a family of models of the form   

  • j∈I

pj(xj|xπG

j ; φj)

  • j∈I

˜ pj(xj; ωI

j )|I ∈ I

   where ωI {ωI

j }j∈I for each I ∈ I are learnable parameters.

The natural optimization problem: max

φ,{ωI}I∈I

E(X,I)∼P(X,I)  

j / ∈I

log pj(Xj|X−j; φj) +

  • j∈I

log pj(Xj; ωI

j )

  s.t. Tr eAφ = d But we do not really care about learning the pj(Xj; ωI

j ) ...

... and problem trivially decomposes as a sum of maxφ and max{ωI}I∈I so ...

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 33 / 40

slide-54
SLIDE 54

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

GraN-DAG with interventions [Brouillard et al., 2020]

... can forget about pj(Xj; ωI

j ) altogether and get

The optimization problem: max

φ

E(X,I)∼P(X,I)

  • j /

∈I

log p(Xj|X−j; φj) s.t. Tr eAφ = d In a nutshell: We throw out the conditionals associated with the intervention variables

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 34 / 40

slide-55
SLIDE 55

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

GraN-DAG with interventions [Brouillard et al., 2020]

Linear data (unidentifiable without interventions) 50 nodes and ≈ 200 edges Intervention on one node at a time

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 35 / 40

slide-56
SLIDE 56

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

GraN-DAG with interventions [Brouillard et al., 2020]

Nonlinear data Linear data More experiments in workshop paper...

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 36 / 40

slide-57
SLIDE 57

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

GraN-DAG with Neural Autoregressive flows

In previous experiments, GraN-DAG models was: Xi = NNφi (XπG

i ) + σiZ

with Z ∼ N(0, 1) ∀i GraN-DAG’s framework allows for usage of "Neural Autoregressive Flows" [Huang et al., 2018b] Xi = NAF(Z; NNφi (XπG

i ))

with Z ∼ N(0, 1) ∀i The function NAF(·; NNφi (XπG

i )) is invertible and with tractable Jacobian so the

likelihood of X can be computed exactly and maximized

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 37 / 40

slide-58
SLIDE 58

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

GraN-DAG with Neural Autoregressive flows

Without interventions, we run into identifiability problems ... Future work: make it works with interventional data (since identifiability is less of a problem)

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 38 / 40

slide-59
SLIDE 59

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

Conclusion and future work

Gradient-based DAG search... ... performs similarly to its discrete analogs ... scales well with number of samples (since amenable to stochastic optimization) ... can be easily adapted to work with interventional data ... allows for very expressive density models (Neural Autoregressive flow) Future work: DAGs appear in many places, could we adapt the neural acyclicity constraint to

  • ther problems? (Not causality?)

Drawing links between causality and representation learning

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 39 / 40

slide-60
SLIDE 60

Overview Causality Framework Structure Learning GraN-DAG & ext. Conclusion

References

Brouillard, P ., Drouin, A., Lachapelle, S., Lacoste, A., & Lacoste-Julien, S. (2020). Gradient-based neural dag learning with interventions. Bühlmann, P ., Peters, J., & Ernest, J. (2014). CAM: Causal additive models, high-dimensional order search and penalized regression. Annals of Statistics. Chickering, D. (2003). Optimal structure identification with greedy search. Journal of Machine Learning Research. Huang, B., Zhang, K., Lin, Y., Schölkopf, B., & Glymour, C. (2018a). Generalized score functions for causal discovery. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &#38; Data Mining. Huang, C.-W., Krueger, D., Lacoste, A., & Courville, A. (2018b). Neural autoregressive flows. Koller, D. & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. MIT Press. Lachapelle, S., Brouillard, P ., Deleu, T., & Lacoste-Julien, S. (2020). Gradient-based neural dag learning. In Proceedings of the Eighth International Conference on Learning Representations (to appear). Peters, J., M. Mooij, J., Janzing, D., & Schölkopf, B. (2014). Causal discovery with continuous additive noise models. Journal of Machine Learning Research. Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D., & Nolan, G. (2005). Causal protein-signaling networks derived from multiparameter single-cell data. Science. Yu, Y., Chen, J., Gao, T., & Yu, M. (2019). DAG-GNN: DAG structure learning with graph neural networks. In Proceedings of the 36th International Conference on Machine Learning. Zheng, X., Aragam, B., Ravikumar, P ., & Xing, E. (2018). Dags with no tears: Continuous optimization for structure learning. In Advances in Neural Information Processing Systems 31.

Sébastien Lachapelle Mila EAI Science Talk March 4th, 2020 40 / 40