gradient based neural dag learning
play

Gradient-Based Neural DAG Learning for Causal Discovery Sbastien - PowerPoint PPT Presentation

Background GraN-DAG Experiments Conclusion Gradient-Based Neural DAG Learning for Causal Discovery Sbastien Lachapelle 1 Philippe Brouillard 1 Tristan Deleu 1 Simon Lacoste-Julien 1 , 2 1Mila, Universit de Montral 2Canada CIFAR AI Chair


  1. Background GraN-DAG Experiments Conclusion Gradient-Based Neural DAG Learning for Causal Discovery Sébastien Lachapelle 1 Philippe Brouillard 1 Tristan Deleu 1 Simon Lacoste-Julien 1 , 2 1Mila, Université de Montréal 2Canada CIFAR AI Chair September 6th, 2019 Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 1 / 17

  2. Background GraN-DAG Experiments Conclusion Causal graphical model (CGM) Random vector X ∈ R d ( d variables) d = 3 Let G be a directed acyclic graph (DAG) Assume p ( x ) = � d i = 1 p ( x i | x π G i ) π G = parents of i in G i Encodes statistical independences CGM is almost identical to a Bayesian network... p ( x 1 | x 2 ) p ( x 2 ) p ( x 3 | x 1 , x 2 ) ...except arrows are given a causal meaning Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 2 / 17

  3. Background GraN-DAG Experiments Conclusion Structure Learning X 1 X 2 X 3 sample 1 1.76 10.46 0.002 sample2 3.42 78.6 0.011 ... ... sample n 4.56 9.35 1.96 Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 3 / 17

  4. Background GraN-DAG Experiments Conclusion Structure Learning X 1 X 2 X 3 sample 1 1.76 10.46 0.002 sample2 3.42 78.6 0.011 ... ... sample n 4.56 9.35 1.96 Score-based algorithms ˆ G = arg max Score ( G ) G∈ DAG Often, Score ( G ) = regularized maximum likelihood under G Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 3 / 17

  5. Background GraN-DAG Experiments Conclusion Structure Learning Taxonomy of score-based algorithms (non-exhaustive) Discrete optim. Continuous optim. GES NOTEARS Linear [Zheng et al., 2018] [Chickering, 2003] CAM GraN-DAG Nonlinear [Bühlmann et al., 2014] [Our contribution] Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 4 / 17

  6. Background GraN-DAG Experiments Conclusion NOTEARS: Continuous optimization for structure learning Encode graph as a weighted adjacency matrix U = [ u 1 | . . . | u d ] ∈ R d × d     0 0 1 0 0 4 . 8         A = U = − 1 . 7 1 0 1 0 . 2 0         0 0 0 0 0 0 � �� � � �� � Adjacency matrix Weighted adjacency matrix Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 5 / 17

  7. Background GraN-DAG Experiments Conclusion NOTEARS: Continuous optimization for structure learning Encode graph as a weighted adjacency matrix U = [ u 1 | . . . | u d ] ∈ R d × d     0 0 1 0 0 4 . 8         A = U = − 1 . 7 1 0 1 0 . 2 0         0 0 0 0 0 0 � �� � � �� � Adjacency matrix Weighted adjacency matrix Represents coefficients in a linear model : X i := u ⊤ i X + noise i ∀ i Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 5 / 17

  8. Background GraN-DAG Experiments Conclusion NOTEARS: Continuous optimization for structure learning Encode graph as a weighted adjacency matrix U = [ u 1 | . . . | u d ] ∈ R d × d     0 0 1 0 0 4 . 8         A = U = − 1 . 7 1 0 1 0 . 2 0         0 0 0 0 0 0 � �� � � �� � Adjacency matrix Weighted adjacency matrix Represents coefficients in a linear model : X i := u ⊤ i X + noise i ∀ i For an arbitrary U , associated graph might be cyclic Acyclicity constraint NOTEARS [Zheng et al., 2018] uses this differentiable acyclicity constraint :  Mk  ∞ Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 5 / 17

  9. Background GraN-DAG Experiments Conclusion NOTEARS: Continuous optimization for structure learning NOTEARS [Zheng et al., 2018]: Solve this continuous constrained optimization problem : Tr e U ⊙ U − d = 0 −� X − X U � 2 max F − λ � U � 1 s.t. U � �� � Score where X ∈ R n × d is the design matrix containing all n samples Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 6 / 17

  10. Background GraN-DAG Experiments Conclusion NOTEARS: Continuous optimization for structure learning NOTEARS [Zheng et al., 2018]: Solve this continuous constrained optimization problem : Tr e U ⊙ U − d = 0 −� X − X U � 2 max F − λ � U � 1 s.t. U � �� � Score where X ∈ R n × d is the design matrix containing all n samples Solve approximately using an Augmented Lagrangian method Amounts to maximizing (with gradient ascent) F − λ � U � 1 − α t ( Tr e U ⊙ U − d ) − µ t 2 ( Tr e U ⊙ U − d ) 2 −� X − X U � 2 while gradually increasing α t and µ t Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 6 / 17

  11. Background GraN-DAG Experiments Conclusion NOTEARS: The acyclicity constraint  ∞ Mk  Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Suppose A ∈ { 0 , 1 } d × d is an adjacency matrix for a certain directed graph Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

  12. Background GraN-DAG Experiments Conclusion NOTEARS: The acyclicity constraint  ∞ Mk  Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Suppose A ∈ { 0 , 1 } d × d is an adjacency matrix for a certain directed graph ( A k ) ii = number of cycles of length k passing through i Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

  13. Background GraN-DAG Experiments Conclusion NOTEARS: The acyclicity constraint  ∞ Mk  Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Suppose A ∈ { 0 , 1 } d × d is an adjacency matrix for a certain directed graph ( A k ) ii = number of cycles of length k passing through i ⇒ ( A k ) ii = 0 for all i and all k Graph acyclic ⇐ Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

  14. Background GraN-DAG Experiments Conclusion NOTEARS: The acyclicity constraint  ∞ Mk  Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Suppose A ∈ { 0 , 1 } d × d is an adjacency matrix for a certain directed graph ( A k ) ii = number of cycles of length k passing through i ⇒ ( A k ) ii = 0 for all i and all k Graph acyclic ⇐ �� ∞ � A k ⇐ ⇒ Tr = 0 k = 1 k ! Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

  15. Background GraN-DAG Experiments Conclusion NOTEARS: The acyclicity constraint  ∞ Mk  Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Suppose A ∈ { 0 , 1 } d × d is an adjacency matrix for a certain directed graph ( A k ) ii = number of cycles of length k passing through i ⇒ ( A k ) ii = 0 for all i and all k Graph acyclic ⇐ �� ∞ � A k ⇐ ⇒ Tr = 0 k = 1 k ! �� ∞ k ! − A 0 � A k ⇐ ⇒ Tr = 0 k = 0 Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

  16. Background GraN-DAG Experiments Conclusion NOTEARS: The acyclicity constraint  ∞ Mk  Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Suppose A ∈ { 0 , 1 } d × d is an adjacency matrix for a certain directed graph ( A k ) ii = number of cycles of length k passing through i ⇒ ( A k ) ii = 0 for all i and all k Graph acyclic ⇐ �� ∞ � A k ⇐ ⇒ Tr = 0 k = 1 k ! �� ∞ k ! − A 0 � A k ⇐ ⇒ Tr = 0 k = 0 ⇒ Tr e A − d = 0 ⇐ Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

  17. Background GraN-DAG Experiments Conclusion NOTEARS: The acyclicity constraint  ∞ Mk  Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Suppose A ∈ { 0 , 1 } d × d is an adjacency matrix for a certain directed graph ( A k ) ii = number of cycles of length k passing through i ⇒ ( A k ) ii = 0 for all i and all k Graph acyclic ⇐ �� ∞ � A k ⇐ ⇒ Tr = 0 k = 1 k ! �� ∞ k ! − A 0 � A k ⇐ ⇒ Tr = 0 k = 0 ⇒ Tr e A − d = 0 ⇐ The argument is almost identical when using weighted adjacency U instead of A ... Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

  18. Background GraN-DAG Experiments Conclusion Gradient-Based Neural DAG Learning Can we go nonlinear ? Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 8 / 17

  19. Background GraN-DAG Experiments Conclusion Gradient-Based Neural DAG Learning φ ( i ) � { W ( 1 ) ( i ) , . . . , W ( L + 1 ) } ( i ) W ( ℓ ) ( i ) = ℓ th weight matrix of NN φ ( i ) φ � { φ ( i ) } d i = 1 Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 9 / 17

  20. Background GraN-DAG Experiments Conclusion Gradient-Based Neural DAG Learning φ ( i ) � { W ( 1 ) ( i ) , . . . , W ( L + 1 ) } ( i ) W ( ℓ ) ( i ) = ℓ th weight matrix of NN φ ( i ) φ � { φ ( i ) } d i = 1 � d i = 1 p ( x i | x − i ; θ ( i ) ) does not decompose according to a DAG! We need to constrain the networks to be acyclic! How? Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 9 / 17

  21. Background GraN-DAG Experiments Conclusion Gradient-Based Neural DAG Learning Key idea: Construct a weighted adjacency matrix A φ (analogous to U from the linear case) which could be used in the acyclicity constraint Then maximize likelihood under acyclicity constraint via augmented Lagrangian d � log p φ ( x i | x − i ) − α t ( Tr e A φ − d ) − µ t 2 ( Tr e A φ − d ) 2 max φ i = 0 � �� � Augmented Lagrangian Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 10 / 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend