Gradient-Based Neural DAG Learning for Causal Discovery Sbastien - - PowerPoint PPT Presentation

gradient based neural dag learning
SMART_READER_LITE
LIVE PREVIEW

Gradient-Based Neural DAG Learning for Causal Discovery Sbastien - - PowerPoint PPT Presentation

Background GraN-DAG Experiments Conclusion Gradient-Based Neural DAG Learning for Causal Discovery Sbastien Lachapelle 1 Philippe Brouillard 1 Tristan Deleu 1 Simon Lacoste-Julien 1 , 2 1Mila, Universit de Montral 2Canada CIFAR AI Chair


slide-1
SLIDE 1

Background GraN-DAG Experiments Conclusion

Gradient-Based Neural DAG Learning

for Causal Discovery

Sébastien Lachapelle1 Philippe Brouillard 1 Tristan Deleu1 Simon Lacoste-Julien1,2

1Mila, Université de Montréal 2Canada CIFAR AI Chair

September 6th, 2019

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 1 / 17

slide-2
SLIDE 2

Background GraN-DAG Experiments Conclusion

Causal graphical model (CGM)

Random vector X ∈ Rd (d variables) Let G be a directed acyclic graph (DAG) Assume p(x) = d

i=1 p(xi|xπG

i )

πG

i

= parents of i in G Encodes statistical independences CGM is almost identical to a Bayesian network... ...except arrows are given a causal meaning d = 3

p(x1 | x2)p(x2)p(x3 | x1, x2)

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 2 / 17

slide-3
SLIDE 3

Background GraN-DAG Experiments Conclusion

Structure Learning

X1 X2 X3 sample 1 1.76 10.46 0.002 sample2 3.42 78.6 0.011 ... ... sample n 4.56 9.35 1.96

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 3 / 17

slide-4
SLIDE 4

Background GraN-DAG Experiments Conclusion

Structure Learning

X1 X2 X3 sample 1 1.76 10.46 0.002 sample2 3.42 78.6 0.011 ... ... sample n 4.56 9.35 1.96 Score-based algorithms ˆ G = arg max

G∈DAG

Score(G) Often, Score(G) = regularized maximum likelihood under G

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 3 / 17

slide-5
SLIDE 5

Background GraN-DAG Experiments Conclusion

Structure Learning

Taxonomy of score-based algorithms (non-exhaustive) Discrete optim. Continuous optim. Linear GES [Chickering, 2003] NOTEARS [Zheng et al., 2018] Nonlinear CAM [Bühlmann et al., 2014] GraN-DAG [Our contribution]

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 4 / 17

slide-6
SLIDE 6

Background GraN-DAG Experiments Conclusion

NOTEARS: Continuous optimization for structure learning

Encode graph as a weighted adjacency matrix U = [u1| . . . |ud] ∈ Rd×d A =      1 1 1     

  • Adjacency matrix

U =      4.8 0.2 −1.7     

  • Weighted adjacency matrix

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 5 / 17

slide-7
SLIDE 7

Background GraN-DAG Experiments Conclusion

NOTEARS: Continuous optimization for structure learning

Encode graph as a weighted adjacency matrix U = [u1| . . . |ud] ∈ Rd×d A =      1 1 1     

  • Adjacency matrix

U =      4.8 0.2 −1.7     

  • Weighted adjacency matrix

Represents coefficients in a linear model: Xi := u⊤

i X + noisei ∀i

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 5 / 17

slide-8
SLIDE 8

Background GraN-DAG Experiments Conclusion

NOTEARS: Continuous optimization for structure learning

Encode graph as a weighted adjacency matrix U = [u1| . . . |ud] ∈ Rd×d A =      1 1 1     

  • Adjacency matrix

U =      4.8 0.2 −1.7     

  • Weighted adjacency matrix

Represents coefficients in a linear model: Xi := u⊤

i X + noisei ∀i

For an arbitrary U, associated graph might be cyclic Acyclicity constraint NOTEARS [Zheng et al., 2018] uses this differentiable acyclicity constraint: Tr eU⊙U − d = 0

 eM ∞

  • k=0

Mk k!   Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 5 / 17

slide-9
SLIDE 9

Background GraN-DAG Experiments Conclusion

NOTEARS: Continuous optimization for structure learning

NOTEARS [Zheng et al., 2018]: Solve this continuous constrained optimization problem: max

U

−X − XU2

F − λU1

  • Score

s.t. Tr eU⊙U − d = 0 where X ∈ Rn×d is the design matrix containing all n samples

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 6 / 17

slide-10
SLIDE 10

Background GraN-DAG Experiments Conclusion

NOTEARS: Continuous optimization for structure learning

NOTEARS [Zheng et al., 2018]: Solve this continuous constrained optimization problem: max

U

−X − XU2

F − λU1

  • Score

s.t. Tr eU⊙U − d = 0 where X ∈ Rn×d is the design matrix containing all n samples Solve approximately using an Augmented Lagrangian method Amounts to maximizing (with gradient ascent) −X − XU2

F − λU1−αt(Tr eU⊙U − d) − µt

2 (Tr eU⊙U − d)2 while gradually increasing αt and µt

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 6 / 17

slide-11
SLIDE 11

Background GraN-DAG Experiments Conclusion

NOTEARS: The acyclicity constraint

Tr eU⊙U − d = 0

 eM ∞

  • k=0

Mk k!  

Suppose A ∈ {0, 1}d×d is an adjacency matrix for a certain directed graph

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

slide-12
SLIDE 12

Background GraN-DAG Experiments Conclusion

NOTEARS: The acyclicity constraint

Tr eU⊙U − d = 0

 eM ∞

  • k=0

Mk k!  

Suppose A ∈ {0, 1}d×d is an adjacency matrix for a certain directed graph (Ak)ii = number of cycles of length k passing through i

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

slide-13
SLIDE 13

Background GraN-DAG Experiments Conclusion

NOTEARS: The acyclicity constraint

Tr eU⊙U − d = 0

 eM ∞

  • k=0

Mk k!  

Suppose A ∈ {0, 1}d×d is an adjacency matrix for a certain directed graph (Ak)ii = number of cycles of length k passing through i Graph acyclic ⇐ ⇒ (Ak)ii = 0 for all i and all k

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

slide-14
SLIDE 14

Background GraN-DAG Experiments Conclusion

NOTEARS: The acyclicity constraint

Tr eU⊙U − d = 0

 eM ∞

  • k=0

Mk k!  

Suppose A ∈ {0, 1}d×d is an adjacency matrix for a certain directed graph (Ak)ii = number of cycles of length k passing through i Graph acyclic ⇐ ⇒ (Ak)ii = 0 for all i and all k ⇐ ⇒ Tr ∞

k=1 Ak k!

  • = 0

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

slide-15
SLIDE 15

Background GraN-DAG Experiments Conclusion

NOTEARS: The acyclicity constraint

Tr eU⊙U − d = 0

 eM ∞

  • k=0

Mk k!  

Suppose A ∈ {0, 1}d×d is an adjacency matrix for a certain directed graph (Ak)ii = number of cycles of length k passing through i Graph acyclic ⇐ ⇒ (Ak)ii = 0 for all i and all k ⇐ ⇒ Tr ∞

k=1 Ak k!

  • = 0

⇐ ⇒ Tr ∞

k=0 Ak k! − A0

= 0

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

slide-16
SLIDE 16

Background GraN-DAG Experiments Conclusion

NOTEARS: The acyclicity constraint

Tr eU⊙U − d = 0

 eM ∞

  • k=0

Mk k!  

Suppose A ∈ {0, 1}d×d is an adjacency matrix for a certain directed graph (Ak)ii = number of cycles of length k passing through i Graph acyclic ⇐ ⇒ (Ak)ii = 0 for all i and all k ⇐ ⇒ Tr ∞

k=1 Ak k!

  • = 0

⇐ ⇒ Tr ∞

k=0 Ak k! − A0

= 0 ⇐ ⇒ Tr eA − d = 0

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

slide-17
SLIDE 17

Background GraN-DAG Experiments Conclusion

NOTEARS: The acyclicity constraint

Tr eU⊙U − d = 0

 eM ∞

  • k=0

Mk k!  

Suppose A ∈ {0, 1}d×d is an adjacency matrix for a certain directed graph (Ak)ii = number of cycles of length k passing through i Graph acyclic ⇐ ⇒ (Ak)ii = 0 for all i and all k ⇐ ⇒ Tr ∞

k=1 Ak k!

  • = 0

⇐ ⇒ Tr ∞

k=0 Ak k! − A0

= 0 ⇐ ⇒ Tr eA − d = 0 The argument is almost identical when using weighted adjacency U instead of A...

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

slide-18
SLIDE 18

Background GraN-DAG Experiments Conclusion

Gradient-Based Neural DAG Learning Can we go nonlinear?

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 8 / 17

slide-19
SLIDE 19

Background GraN-DAG Experiments Conclusion

Gradient-Based Neural DAG Learning

φ(i) {W (1)

(i) , . . . , W (L+1) (i)

} W (ℓ)

(i) = ℓth weight matrix of NNφ(i)

φ {φ(i)}d

i=1 Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 9 / 17

slide-20
SLIDE 20

Background GraN-DAG Experiments Conclusion

Gradient-Based Neural DAG Learning

φ(i) {W (1)

(i) , . . . , W (L+1) (i)

} W (ℓ)

(i) = ℓth weight matrix of NNφ(i)

φ {φ(i)}d

i=1

d

i=1 p(xi|x−i; θ(i)) does not decompose according to a DAG!

We need to constrain the networks to be acyclic! How?

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 9 / 17

slide-21
SLIDE 21

Background GraN-DAG Experiments Conclusion

Gradient-Based Neural DAG Learning

Key idea: Construct a weighted adjacency matrix Aφ (analogous to U from the linear case) which could be used in the acyclicity constraint Then maximize likelihood under acyclicity constraint via augmented Lagrangian max

φ d

  • i=0

log pφ(xi|x−i)−αt(Tr eAφ − d) − µt 2 (Tr eAφ − d)2

  • Augmented Lagrangian

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 10 / 17

slide-22
SLIDE 22

Background GraN-DAG Experiments Conclusion

Constructing weighted adjacency matrix Aφ

Let’s measure the "strength" of edge Xj → Xi

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 11 / 17

slide-23
SLIDE 23

Background GraN-DAG Experiments Conclusion

Constructing weighted adjacency matrix Aφ

Let’s measure the "strength" of edge Xj → Xi Path product: |W (1)

h1j ||W (2) h2h1||W (3) kh2| ≥ 0

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 12 / 17

slide-24
SLIDE 24

Background GraN-DAG Experiments Conclusion

Constructing weighted adjacency matrix Aφ

Let’s measure the "strength" of edge Xj → Xi Path product: |W (1)

h1j ||W (2) h2h1||W (3) kh2| ≥ 0

C |W (3)||W (2)||W (1)| "Connection strength" from Xj to θ(i) : m

k=1 Ckj ≥ 0

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 13 / 17

slide-25
SLIDE 25

Background GraN-DAG Experiments Conclusion

Constructing weighted adjacency matrix Aφ

Let’s measure the "strength" of edge Xj → Xi Path product: |W (1)

h1j ||W (2) h2h1||W (3) kh2| ≥ 0

C |W (3)||W (2)||W (1)| "Connection strength" from Xj to θ(i) : m

k=1 Ckj ≥ 0

m

k=1 Ckj = 0 ⇒ All paths from Xj to Xi are inactive!

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 13 / 17

slide-26
SLIDE 26

Background GraN-DAG Experiments Conclusion

Constructing weighted adjacency matrix Aφ

Let’s measure the "strength" of edge Xj → Xi Path product: |W (1)

h1j ||W (2) h2h1||W (3) kh2| ≥ 0

C |W (3)||W (2)||W (1)| "Connection strength" from Xj to θ(i) : m

k=1 Ckj ≥ 0

m

k=1 Ckj = 0 ⇒ All paths from Xj to Xi are inactive!

  • ji

   m

k=1

  • C(i)
  • kj ,

if i = j 0,

  • therwise

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 13 / 17

slide-27
SLIDE 27

Background GraN-DAG Experiments Conclusion

Gradient-Based Neural DAG Learning

Correct edges Wrong edges

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 14 / 17

slide-28
SLIDE 28

Background GraN-DAG Experiments Conclusion

Experiments

Synthetic data: Xi|XπG

i

∼ N(fi(XπG ), σ2

i )

fi ∼ Gaussian Process Real data: Measurements of expression levels of proteins and phospholipids in human immune system cells [Sachs et al., 2005] Synthetic (50 nodes) Protein data set SHD SID SHD SID Continuous GraN-DAG 102.6±21.2 1060.1±109.4 13 47 DAG-GNN 191.9±15.2 2146.2±64 16 44 NOTEARS 202.3±14.3 2149.1±76.3 21 44 Discrete CAM 98.8±20.7 1197.2±125.9 12 55 RANDOM 708.4±234.4 1921.3±203.5 21 60

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 15 / 17

slide-29
SLIDE 29

Background GraN-DAG Experiments Conclusion

Conclusion and future work

Contributions: We proposed a new characterization of acyclicity for NN GraN-DAG is the first nonlinear continuous approach shown to be competitive with SOTA nonlinear discrete approaches Future work: Working with interventional data DAGs appear in many places, could we adapt the neural acyclicity constraint to

  • ther problems? (Not causality?)

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 16 / 17

slide-30
SLIDE 30

Background GraN-DAG Experiments Conclusion

References

Bühlmann, P ., Peters, J., & Ernest, J. (2014). CAM: Causal additive models, high-dimensional order search and penalized regression. Annals of Statistics. Chickering, D. (2003). Optimal structure identification with greedy search. Journal of Machine Learning Research. Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D., & Nolan, G. (2005). Causal protein-signaling networks derived from multiparameter single-cell data. Science. Zheng, X., Aragam, B., Ravikumar, P ., & Xing, E. (2018). Dags with no tears: Continuous optimization for structure learning. In Advances in Neural Information Processing Systems 31.

Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 17 / 17