Gradient-Based Neural DAG Learning for Causal Discovery Sbastien - PowerPoint PPT Presentation

Background GraN-DAG Experiments Conclusion Gradient-Based Neural DAG Learning for Causal Discovery Sébastien Lachapelle 1 Philippe Brouillard 1 Tristan Deleu 1 Simon Lacoste-Julien 1 , 2 1Mila, Université de Montréal 2Canada CIFAR AI Chair September 6th, 2019 Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 1 / 17

Background GraN-DAG Experiments Conclusion Causal graphical model (CGM) Random vector X ∈ R d ( d variables) d = 3 Let G be a directed acyclic graph (DAG) Assume p ( x ) = � d i = 1 p ( x i | x π G i ) π G = parents of i in G i Encodes statistical independences CGM is almost identical to a Bayesian network... p ( x 1 | x 2 ) p ( x 2 ) p ( x 3 | x 1 , x 2 ) ...except arrows are given a causal meaning Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 2 / 17

Background GraN-DAG Experiments Conclusion Structure Learning X 1 X 2 X 3 sample 1 1.76 10.46 0.002 sample2 3.42 78.6 0.011 ... ... sample n 4.56 9.35 1.96 Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 3 / 17

Background GraN-DAG Experiments Conclusion Structure Learning X 1 X 2 X 3 sample 1 1.76 10.46 0.002 sample2 3.42 78.6 0.011 ... ... sample n 4.56 9.35 1.96 Score-based algorithms ˆ G = arg max Score ( G ) G∈ DAG Often, Score ( G ) = regularized maximum likelihood under G Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 3 / 17

Background GraN-DAG Experiments Conclusion Structure Learning Taxonomy of score-based algorithms (non-exhaustive) Discrete optim. Continuous optim. GES NOTEARS Linear [Zheng et al., 2018] [Chickering, 2003] CAM GraN-DAG Nonlinear [Bühlmann et al., 2014] [Our contribution] Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 4 / 17

Background GraN-DAG Experiments Conclusion NOTEARS: Continuous optimization for structure learning Encode graph as a weighted adjacency matrix U = [ u 1 | . . . | u d ] ∈ R d × d     0 0 1 0 0 4 . 8         A = U = − 1 . 7 1 0 1 0 . 2 0         0 0 0 0 0 0 � �� Adjacency matrix Weighted adjacency matrix Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 5 / 17

Background GraN-DAG Experiments Conclusion NOTEARS: Continuous optimization for structure learning Encode graph as a weighted adjacency matrix U = [ u 1 | . . . | u d ] ∈ R d × d     0 0 1 0 0 4 . 8         A = U = − 1 . 7 1 0 1 0 . 2 0         0 0 0 0 0 0 � �� Adjacency matrix Weighted adjacency matrix Represents coefficients in a linear model : X i := u ⊤ i X + noise i ∀ i Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 5 / 17

Background GraN-DAG Experiments Conclusion NOTEARS: Continuous optimization for structure learning Encode graph as a weighted adjacency matrix U = [ u 1 | . . . | u d ] ∈ R d × d     0 0 1 0 0 4 . 8         A = U = − 1 . 7 1 0 1 0 . 2 0         0 0 0 0 0 0 � �� Adjacency matrix Weighted adjacency matrix Represents coefficients in a linear model : X i := u ⊤ i X + noise i ∀ i For an arbitrary U , associated graph might be cyclic Acyclicity constraint NOTEARS [Zheng et al., 2018] uses this differentiable acyclicity constraint :  Mk  ∞ Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 5 / 17

Background GraN-DAG Experiments Conclusion NOTEARS: Continuous optimization for structure learning NOTEARS [Zheng et al., 2018]: Solve this continuous constrained optimization problem : Tr e U ⊙ U − d = 0 −� X − X U � 2 max F − λ � U � 1 s.t. U � �� Score where X ∈ R n × d is the design matrix containing all n samples Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 6 / 17

Background GraN-DAG Experiments Conclusion NOTEARS: Continuous optimization for structure learning NOTEARS [Zheng et al., 2018]: Solve this continuous constrained optimization problem : Tr e U ⊙ U − d = 0 −� X − X U � 2 max F − λ � U � 1 s.t. U � �� Score where X ∈ R n × d is the design matrix containing all n samples Solve approximately using an Augmented Lagrangian method Amounts to maximizing (with gradient ascent) F − λ � U � 1 − α t ( Tr e U ⊙ U − d ) − µ t 2 ( Tr e U ⊙ U − d ) 2 −� X − X U � 2 while gradually increasing α t and µ t Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 6 / 17

Background GraN-DAG Experiments Conclusion NOTEARS: The acyclicity constraint  ∞ Mk  Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Suppose A ∈ { 0 , 1 } d × d is an adjacency matrix for a certain directed graph Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

Background GraN-DAG Experiments Conclusion NOTEARS: The acyclicity constraint  ∞ Mk  Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Suppose A ∈ { 0 , 1 } d × d is an adjacency matrix for a certain directed graph ( A k ) ii = number of cycles of length k passing through i Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

Background GraN-DAG Experiments Conclusion NOTEARS: The acyclicity constraint  ∞ Mk  Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Suppose A ∈ { 0 , 1 } d × d is an adjacency matrix for a certain directed graph ( A k ) ii = number of cycles of length k passing through i ⇒ ( A k ) ii = 0 for all i and all k Graph acyclic ⇐ Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

Background GraN-DAG Experiments Conclusion NOTEARS: The acyclicity constraint  ∞ Mk  Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Suppose A ∈ { 0 , 1 } d × d is an adjacency matrix for a certain directed graph ( A k ) ii = number of cycles of length k passing through i ⇒ ( A k ) ii = 0 for all i and all k Graph acyclic ⇐ �� ∞ � A k ⇐ ⇒ Tr = 0 k = 1 k ! Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

Background GraN-DAG Experiments Conclusion NOTEARS: The acyclicity constraint  ∞ Mk  Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Suppose A ∈ { 0 , 1 } d × d is an adjacency matrix for a certain directed graph ( A k ) ii = number of cycles of length k passing through i ⇒ ( A k ) ii = 0 for all i and all k Graph acyclic ⇐ �� ∞ � A k ⇐ ⇒ Tr = 0 k = 1 k ! �� ∞ k ! − A 0 � A k ⇐ ⇒ Tr = 0 k = 0 Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

Background GraN-DAG Experiments Conclusion NOTEARS: The acyclicity constraint  ∞ Mk  Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Suppose A ∈ { 0 , 1 } d × d is an adjacency matrix for a certain directed graph ( A k ) ii = number of cycles of length k passing through i ⇒ ( A k ) ii = 0 for all i and all k Graph acyclic ⇐ �� ∞ � A k ⇐ ⇒ Tr = 0 k = 1 k ! �� ∞ k ! − A 0 � A k ⇐ ⇒ Tr = 0 k = 0 ⇒ Tr e A − d = 0 ⇐ Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

Background GraN-DAG Experiments Conclusion NOTEARS: The acyclicity constraint  ∞ Mk  Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Suppose A ∈ { 0 , 1 } d × d is an adjacency matrix for a certain directed graph ( A k ) ii = number of cycles of length k passing through i ⇒ ( A k ) ii = 0 for all i and all k Graph acyclic ⇐ �� ∞ � A k ⇐ ⇒ Tr = 0 k = 1 k ! �� ∞ k ! − A 0 � A k ⇐ ⇒ Tr = 0 k = 0 ⇒ Tr e A − d = 0 ⇐ The argument is almost identical when using weighted adjacency U instead of A ... Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

Background GraN-DAG Experiments Conclusion Gradient-Based Neural DAG Learning Can we go nonlinear ? Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 8 / 17

Background GraN-DAG Experiments Conclusion Gradient-Based Neural DAG Learning φ ( i ) � { W ( 1 ) ( i ) , . . . , W ( L + 1 ) } ( i ) W ( ℓ ) ( i ) = ℓ th weight matrix of NN φ ( i ) φ � { φ ( i ) } d i = 1 Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 9 / 17

Background GraN-DAG Experiments Conclusion Gradient-Based Neural DAG Learning φ ( i ) � { W ( 1 ) ( i ) , . . . , W ( L + 1 ) } ( i ) W ( ℓ ) ( i ) = ℓ th weight matrix of NN φ ( i ) φ � { φ ( i ) } d i = 1 � d i = 1 p ( x i | x − i ; θ ( i ) ) does not decompose according to a DAG! We need to constrain the networks to be acyclic! How? Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 9 / 17

Background GraN-DAG Experiments Conclusion Gradient-Based Neural DAG Learning Key idea: Construct a weighted adjacency matrix A φ (analogous to U from the linear case) which could be used in the acyclicity constraint Then maximize likelihood under acyclicity constraint via augmented Lagrangian d � log p φ ( x i | x − i ) − α t ( Tr e A φ − d ) − µ t 2 ( Tr e A φ − d ) 2 max φ i = 0 � �� Augmented Lagrangian Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 10 / 17

Gradient-Based Neural DAG Learning for Causal Discovery Sbastien - PowerPoint PPT Presentation

Background GraN-DAG Experiments Conclusion Gradient-Based Neural DAG Learning for Causal Discovery Sbastien Lachapelle 1 Philippe Brouillard 1 Tristan Deleu 1 Simon Lacoste-Julien 1 , 2 1Mila, Universit de Montral 2Canada CIFAR AI Chair

CSE 421 Longest Path in a DAG, LIS, Shortest Path with Negative Weights Shayan Oveis Gharan 1

XD XDAG: PoW + DA DAG frozen@xdag.io XDAG: A new DAG-based cryptocurrency The first mineable

DAG-GNN: DAG Structure Learning with Graph Neural Networks Yue Yu 1 , Jie Chen 2 , 3 , Tian Gao 3 ,

The PROIEL corpora Dag Trygve Truslew Haug Milan, 4 June 2019 Dag Haug PROIEL Milan, 4 June

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

WELCOME! BENSON POLYTECHNIC HIGH SCHOOL / DAG #6 (04.25.2019) BENSON POLYTECHNIC H.S. DAG #6 /

CAUSAL DISCOVERY CAUSAL DISCOVERY Beware of the DAG! Beware of the DAG! Philip Dawid

AIP-31: Airflow functional DAG Airflow Summit 2020 1 Introduction 2 Why functional DAG? 3

Practical workflow for technical documentation Dag Wiers - dag@wieers.com Goals From

CS 401 Greedy Algorithms Xiaorui Sun 1 Directed Acyclic Graphs (DAG) Def: A DAG is a directed

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Northrop Grumman Cybersecurity Research Consortium (NGCRC) Intelligent Autonomous Systems based

Notation Fundamentals Definition. A matrix is defined as an ordered array of numbers, of dimensions

One Step Mutation (OSM) matrices joint work with Sequence Evolution 1 Sequence Evolution

STATISTICS 536B, Lecture #7 March 19, 2015 Network Meta-Analysis? Indirect Comparisons?

DAGs with NO TEARS Continuous Optimization for Structure Learning Xun Zheng Bryon Aragam

3.5 Connectivity in Directed Graphs Directed Graphs Directed graph. G = (V, E) Edge (u, v)

AN INTRODUCTION TO WORKFLOWS WITH DAGMAN Presented by Lauren Michael 1 HTCondor Week 2019

Scheduling Parallel DAG Jobs Online Ben Moseley (CMU) Joint work with: Kunal Agrawal (WahsU)