SLIDE 1 Causal Theories: A Categorical Approach to Bayesian Networks
Brendan Fong, University of Oxford
PSU Applied Algebra and Network Theory Seminar. 22 April 2015
SLIDE 2 Task: formalise how Bayesian networks guide
- ur reasoning about a collection of random
variables.
SLIDE 3 Test: Simpson’s paradox
What should we conclude from the following data? TB t,b ¬t,b t,¬b ¬t,¬b R r 30 40 9 2 ¬r 10 40 51 18
t ¬t R r 39 42 ¬r 61 58
SLIDE 4 Test: Simpson’s paradox
What should we conclude from the following data? TB t,b ¬t,b t,¬b ¬t,¬b R r 30 40 9 2 ¬r 10 40 51 18
t ¬t R r 39 42 ¬r 61 58
SLIDE 5
The big picture
Network-style diagrammatic languages have developed to represent and reason about many different sciences: Why? Can we formalise their key features and relationships? Can we unify them? I work with John Baez and his team to understand these questions from a category-theoretic viewpoint.
SLIDE 6
The big picture
Network-style diagrammatic languages have developed to represent and reason about many different sciences: Why? Can we formalise their key features and relationships? Can we unify them? I work with John Baez and his team to understand these questions from a category-theoretic viewpoint.
SLIDE 7 Outline
- 1. Categories
- 2. Lawvere theories
- 3. Bayesian networks
- 4. Causal theories
- 5. An application to Simpson’s paradox
SLIDE 8
Categories
○ Categories are a great algebraic framework for discussing interconnection, or composition. They first arose in the 1940s in algebraic topology. ○ From the 1980s, it became clear they had a role to play in formalising uses of string/network diagrams, such as Feynman diagrams:
SLIDE 9
Categories
○ Categories are a great algebraic framework for discussing interconnection, or composition. They first arose in the 1940s in algebraic topology. ○ From the 1980s, it became clear they had a role to play in formalising uses of string/network diagrams, such as Feynman diagrams:
SLIDE 10
Categories
○ Categories are a great algebraic framework for discussing interconnection, or composition. They first arose in the 1940s in algebraic topology. ○ From the 1980s, it became clear they had a role to play in formalising uses of string/network diagrams, such as Feynman diagrams:
SLIDE 11 Categories
○ A category C is the structure of one-dimensional flow charts. ○ They comprise objects, or types: X,Y,etc. together with morphisms between these types: f
X Y
○ We can compose morphisms of matching types to get new morphisms: f g h
X
W
Y Z
The composition rule must have such pictures unambiguously describe a morphism.
SLIDE 12 Categories
○ A category C is the structure of one-dimensional flow charts. ○ They comprise objects, or types: X,Y,etc. together with morphisms between these types: f
X Y
○ We can compose morphisms of matching types to get new morphisms: f g h
X
W
Y Z
The composition rule must have such pictures unambiguously describe a morphism.
SLIDE 13 Categories
○ A category C is the structure of one-dimensional flow charts. ○ They comprise objects, or types: X,Y,etc. together with morphisms between these types: f
X Y
○ We can compose morphisms of matching types to get new morphisms: f g h
X
W
Y Z
The composition rule must have such pictures unambiguously describe a morphism.
SLIDE 14 Categories
○ There are various types of categories that allow extra
○ A monoidal category (C,⊗) is the structure of two-dimensional flow charts. f g h k
X Y X Z V
V W X
The key point is that we have a notion of ‘parallel’ or tensor composition. ○ A symmetric monoidal category further allows you to cross wires:
X X Y Y
SLIDE 15 Categories
○ There are various types of categories that allow extra
○ A monoidal category (C,⊗) is the structure of two-dimensional flow charts. f g h k
X Y X Z V
V W X
The key point is that we have a notion of ‘parallel’ or tensor composition. ○ A symmetric monoidal category further allows you to cross wires:
X X Y Y
SLIDE 16 Categories
○ There are various types of categories that allow extra
○ A monoidal category (C,⊗) is the structure of two-dimensional flow charts. f g h k
X Y X Z V
V W X
The key point is that we have a notion of ‘parallel’ or tensor composition. ○ A symmetric monoidal category further allows you to cross wires:
X X Y Y
SLIDE 17 Categories
○ A functor F∶C → D is a map between categories. ○ It turns morphisms f
X Y
in C into morphisms Ff
FX FY
in D. This assignment must preserve composition. That is, the diagram Ff Fg
FX
FY
FZ
must be unambiguous. ○ A monoidal functor F∶(C,×) → (D,⊠) also preserves the tensor.
SLIDE 18 Categories
○ A functor F∶C → D is a map between categories. ○ It turns morphisms f
X Y
in C into morphisms Ff
FX FY
in D. This assignment must preserve composition. That is, the diagram Ff Fg
FX
FY
FZ
must be unambiguous. ○ A monoidal functor F∶(C,×) → (D,⊠) also preserves the tensor.
SLIDE 19 Categories
○ A functor F∶C → D is a map between categories. ○ It turns morphisms f
X Y
in C into morphisms Ff
FX FY
in D. This assignment must preserve composition. That is, the diagram Ff Fg
FX
FY
FZ
must be unambiguous. ○ A monoidal functor F∶(C,×) → (D,⊠) also preserves the tensor.
SLIDE 20
Lawvere theories
○ Lawvere theories were developed by William Lawvere in his 1963 doctoral thesis as a categorical approach to universal algebra. ○ Let S be a set; we call the elements of this set sorts. A (multisorted) Lawvere theory T is a category with finite products such that every object is a finite product of sorts. ○ Each Lawvere theory T is a monoidal category (T,×) with tensor the product ×. ○ If (C,⊗) is a monoidal category, and T is a Lawvere theory, a model of T in (C,⊗) is a monoidal functor F∶(T,×) → (C,⊗).
SLIDE 21
Lawvere theories
○ Lawvere theories were developed by William Lawvere in his 1963 doctoral thesis as a categorical approach to universal algebra. ○ Let S be a set; we call the elements of this set sorts. A (multisorted) Lawvere theory T is a category with finite products such that every object is a finite product of sorts. ○ Each Lawvere theory T is a monoidal category (T,×) with tensor the product ×. ○ If (C,⊗) is a monoidal category, and T is a Lawvere theory, a model of T in (C,⊗) is a monoidal functor F∶(T,×) → (C,⊗).
SLIDE 22
Lawvere theories
○ Lawvere theories were developed by William Lawvere in his 1963 doctoral thesis as a categorical approach to universal algebra. ○ Let S be a set; we call the elements of this set sorts. A (multisorted) Lawvere theory T is a category with finite products such that every object is a finite product of sorts. ○ Each Lawvere theory T is a monoidal category (T,×) with tensor the product ×. ○ If (C,⊗) is a monoidal category, and T is a Lawvere theory, a model of T in (C,⊗) is a monoidal functor F∶(T,×) → (C,⊗).
SLIDE 23
Lawvere theories
○ Lawvere theories were developed by William Lawvere in his 1963 doctoral thesis as a categorical approach to universal algebra. ○ Let S be a set; we call the elements of this set sorts. A (multisorted) Lawvere theory T is a category with finite products such that every object is a finite product of sorts. ○ Each Lawvere theory T is a monoidal category (T,×) with tensor the product ×. ○ If (C,⊗) is a monoidal category, and T is a Lawvere theory, a model of T in (C,⊗) is a monoidal functor F∶(T,×) → (C,⊗).
SLIDE 24 Lawvere theories
Example: the theory of groups
○ Let TG have objects Xn for n ∈ N, and morphisms generated by µ∶X × X → X, η∶1 → X, and ι∶X → X subject to: µ µ = µ µ (Assoc) η µ = = η µ (Id) ι µ = η (Inv) ι ι = ○ A model in (Set,×) is a group. ○ A model in (Top,×) is a topological group.
SLIDE 25 Lawvere theories
Example: the theory of groups
○ Let TG have objects Xn for n ∈ N, and morphisms generated by µ∶X × X → X, η∶1 → X, and ι∶X → X subject to: µ µ = µ µ (Assoc) η µ = = η µ (Id) ι µ = η (Inv) ι ι = ○ A model in (Set,×) is a group. ○ A model in (Top,×) is a topological group.
SLIDE 26 Lawvere theories
Example: the theory of groups
○ Let TG have objects Xn for n ∈ N, and morphisms generated by µ∶X × X → X, η∶1 → X, and ι∶X → X subject to: µ µ = µ µ (Assoc) η µ = = η µ (Id) ι µ = η (Inv) ι ι = ○ A model in (Set,×) is a group. ○ A model in (Top,×) is a topological group.
SLIDE 27 Outline
- 1. Categories
- 2. Lawvere theories
- 3. Bayesian networks
- 4. Causal theories
- 5. An application to Simpson’s paradox
SLIDE 28 Bayesian networks
○ Bayesian networks represent conditional dependencies between random variables using directed acyclic graphs (DAGs). They were developed by Judea Pearl in the 1980s. ○ In a DAG, we say that a vertex u is a parent of a vertex v if there exists an edge from u to v. We write pa(v) for the set
○ We say that u is a descendant of v if there exists a directed path from u to v. We write de(v) for the set of descendants
○ A Bayesian network is a DAG G together with a set of random variables indexed by the vertices V of G, such that each variable is conditionally independent of its non-descendants given its parents. That is, for all v ∈ V, P(Xv,XV∖de(v) ∣ Xpa(v)) = P(Xv ∣ Xpa(v))P(XV∖de(v) ∣ Xpa(v))
SLIDE 29 Bayesian networks
○ Bayesian networks represent conditional dependencies between random variables using directed acyclic graphs (DAGs). They were developed by Judea Pearl in the 1980s. ○ In a DAG, we say that a vertex u is a parent of a vertex v if there exists an edge from u to v. We write pa(v) for the set
○ We say that u is a descendant of v if there exists a directed path from u to v. We write de(v) for the set of descendants
○ A Bayesian network is a DAG G together with a set of random variables indexed by the vertices V of G, such that each variable is conditionally independent of its non-descendants given its parents. That is, for all v ∈ V, P(Xv,XV∖de(v) ∣ Xpa(v)) = P(Xv ∣ Xpa(v))P(XV∖de(v) ∣ Xpa(v))
SLIDE 30 Bayesian networks
○ Bayesian networks represent conditional dependencies between random variables using directed acyclic graphs (DAGs). They were developed by Judea Pearl in the 1980s. ○ In a DAG, we say that a vertex u is a parent of a vertex v if there exists an edge from u to v. We write pa(v) for the set
○ We say that u is a descendant of v if there exists a directed path from u to v. We write de(v) for the set of descendants
○ A Bayesian network is a DAG G together with a set of random variables indexed by the vertices V of G, such that each variable is conditionally independent of its non-descendants given its parents. That is, for all v ∈ V, P(Xv,XV∖de(v) ∣ Xpa(v)) = P(Xv ∣ Xpa(v))P(XV∖de(v) ∣ Xpa(v))
SLIDE 31 Bayesian networks
○ Bayesian networks represent conditional dependencies between random variables using directed acyclic graphs (DAGs). They were developed by Judea Pearl in the 1980s. ○ In a DAG, we say that a vertex u is a parent of a vertex v if there exists an edge from u to v. We write pa(v) for the set
○ We say that u is a descendant of v if there exists a directed path from u to v. We write de(v) for the set of descendants
○ A Bayesian network is a DAG G together with a set of random variables indexed by the vertices V of G, such that each variable is conditionally independent of its non-descendants given its parents. That is, for all v ∈ V, P(Xv,XV∖de(v) ∣ Xpa(v)) = P(Xv ∣ Xpa(v))P(XV∖de(v) ∣ Xpa(v))
SLIDE 32
Bayesian networks
○ For example, suppose we have a Bayesian network on the following DAG: ○ Then if we know the road conditions and the average speed, knowing the season tells us no more about the danger level, and vice versa. ○ This matches our intuitions about causal reasoning.
SLIDE 33
Bayesian networks
○ For example, suppose we have a Bayesian network on the following DAG: ○ Then if we know the road conditions and the average speed, knowing the season tells us no more about the danger level, and vice versa. ○ This matches our intuitions about causal reasoning.
SLIDE 34
Bayesian networks
○ For example, suppose we have a Bayesian network on the following DAG: ○ Then if we know the road conditions and the average speed, knowing the season tells us no more about the danger level, and vice versa. ○ This matches our intuitions about causal reasoning.
SLIDE 35
Causal theories
○ Given a DAG G with vertices V, the causal theory CG corresponding to it is the V-sorted Lawvere theory freely generated by morphisms [v∣pa(v)]∶pa(v) → v for all v ∈ V. ○ For example, the causal theory corresponding to A B C has as objects words in A,B,C and morphisms generated by A B and C∣AB . ○ For all w,z ⊆ V, there exists a morphism [w∣∣z]∶z → w in CG, known as the causal conditional, that expresses the causal influence of z on w.
SLIDE 36
Causal theories
○ Given a DAG G with vertices V, the causal theory CG corresponding to it is the V-sorted Lawvere theory freely generated by morphisms [v∣pa(v)]∶pa(v) → v for all v ∈ V. ○ For example, the causal theory corresponding to A B C has as objects words in A,B,C and morphisms generated by A B and C∣AB . ○ For all w,z ⊆ V, there exists a morphism [w∣∣z]∶z → w in CG, known as the causal conditional, that expresses the causal influence of z on w.
SLIDE 37
Causal theories
○ Given a DAG G with vertices V, the causal theory CG corresponding to it is the V-sorted Lawvere theory freely generated by morphisms [v∣pa(v)]∶pa(v) → v for all v ∈ V. ○ For example, the causal theory corresponding to A B C has as objects words in A,B,C and morphisms generated by A B and C∣AB . ○ For all w,z ⊆ V, there exists a morphism [w∣∣z]∶z → w in CG, known as the causal conditional, that expresses the causal influence of z on w.
SLIDE 38
Causal theories
○ We are particularly interested in models of causal theories in the category Stoch of finite sets and stochastic matrices. ○ This has objects finite sets. ○ Given finite sets X and Y, a morphism X → Y is an ∣Y∣ × ∣X∣-matrix, with entries in (0,1), such that each column sums to 1. For example, if X has 4 elements and Y has 2 elements (0.75 0.15 0.5 0.1 0.25 0.85 0.5 0.9) is a morphism X → Y. ○ Composition is by matrix multiplication, tensor by Cartesian product of sets and tensor product of matrices.
SLIDE 39
Causal theories
○ We are particularly interested in models of causal theories in the category Stoch of finite sets and stochastic matrices. ○ This has objects finite sets. ○ Given finite sets X and Y, a morphism X → Y is an ∣Y∣ × ∣X∣-matrix, with entries in (0,1), such that each column sums to 1. For example, if X has 4 elements and Y has 2 elements (0.75 0.15 0.5 0.1 0.25 0.85 0.5 0.9) is a morphism X → Y. ○ Composition is by matrix multiplication, tensor by Cartesian product of sets and tensor product of matrices.
SLIDE 40
Causal theories
○ We are particularly interested in models of causal theories in the category Stoch of finite sets and stochastic matrices. ○ This has objects finite sets. ○ Given finite sets X and Y, a morphism X → Y is an ∣Y∣ × ∣X∣-matrix, with entries in (0,1), such that each column sums to 1. For example, if X has 4 elements and Y has 2 elements (0.75 0.15 0.5 0.1 0.25 0.85 0.5 0.9) is a morphism X → Y. ○ Composition is by matrix multiplication, tensor by Cartesian product of sets and tensor product of matrices.
SLIDE 41
Causal theories
○ We are particularly interested in models of causal theories in the category Stoch of finite sets and stochastic matrices. ○ This has objects finite sets. ○ Given finite sets X and Y, a morphism X → Y is an ∣Y∣ × ∣X∣-matrix, with entries in (0,1), such that each column sums to 1. For example, if X has 4 elements and Y has 2 elements (0.75 0.15 0.5 0.1 0.25 0.85 0.5 0.9) is a morphism X → Y. ○ Composition is by matrix multiplication, tensor by Cartesian product of sets and tensor product of matrices.
SLIDE 42
Causal theories
○ We are particularly interested in models of causal theories in the category Stoch of finite sets and stochastic matrices. ○ This has objects finite sets. ○ Given finite sets X and Y, a morphism X → Y is an ∣Y∣ × ∣X∣-matrix, with entries in (0,1), such that each column sums to 1. For example, if X has 4 elements and Y has 2 elements (0.75 0.15 0.5 0.1 0.25 0.85 0.5 0.9) is a morphism X → Y. ○ Composition is by matrix multiplication, tensor by Cartesian product of sets and tensor product of matrices.
SLIDE 43
Causal theories
Recall that a random variable X induces a probability distribution on its set of possible values. We call two random variables equivalent if they induce the same distribution, and two Bayesian networks equivalent if their induced joint distributions are equivalent.
SLIDE 44 Causal theories
Theorem Let G be a DAG. The models of the causal theory CG in Stoch are in
- ne-to-one correspondence with equivalence classes of Bayesian
networks on G such that the random variables corresponding to its vertices are finite and of full support.
SLIDE 45 Outline
- 1. Categories
- 2. Lawvere theories
- 3. Bayesian networks
- 4. Causal theories
- 5. An application to Simpson’s paradox
SLIDE 46
Simpson’s paradox
Suppose we trial a treatment with the variables: T: the patient receives treatment for heart condition. R: the patient has recovered at the conclusion of treatment. B: the patient has healthy blood pressure at post-treatment checkup. What should we conclude from the following data? TB t,b ¬t,b t,¬b ¬t,¬b R r 30 40 9 2 ¬r 10 40 51 18
SLIDE 47
Simpson’s paradox
Let’s look at their causal theories. There are two causal structures we might consider: T R B T R B The morphisms in their corresponding causal theories lead us to the two different conclusions.
SLIDE 48
Simpson’s paradox
Let’s look at their causal theories. There are two causal structures we might consider: T R B T R B The morphisms in their corresponding causal theories lead us to the two different conclusions.
SLIDE 49
Simpson’s paradox
Let’s look at their causal theories. There are two causal structures we might consider: T R B T R B The morphisms in their corresponding causal theories lead us to the two different conclusions.
SLIDE 50
Simpson’s paradox
In the first case we have the causal theory and model in Stoch given by T R B T = (0.5 0.5) B∣T = (0.4 0.8 0.6 0.2) R∣TB = (0.75 0.15 0.5 0.1 0.25 0.85 0.5 0.9)
SLIDE 51
Simpson’s paradox
This leads us to reason via the morphism T R B P[R∣∣T] = B∣T R∣TB T R = (0.39 0.42 0.61 0.58)
SLIDE 52
Simpson’s paradox
In the second case we have the causal theory and model in Stoch given by T R B T = (0.5 0.5) B = (0.6 0.4) R∣TB = (0.75 0.15 0.5 0.1 0.25 0.85 0.5 0.9)
SLIDE 53
Simpson’s paradox
We thus reason by the morphism T R B Q[R∣∣T] = B R∣TB T R = (0.51 0.34 0.49 0.66)
SLIDE 54 Simpson’s paradox
Thus if the treatment acts via affecting blood pressure, it is
- ineffective. On the other hand, if it is the blood pressure
affecting the efficacy of the treatment, then the treatment is effective.
SLIDE 55 Thanks for listening. More details: arXiv:1301.6201. Slides will be available at http://www.cs.ox.ac.uk/people/brendan.fong/. This part of John Baez’s networks programme; see http://math.ucr.edu/baez/networks/.