Causal Theories: A Categorical Approach to Bayesian Networks - - PowerPoint PPT Presentation

causal theories a categorical approach to bayesian
SMART_READER_LITE
LIVE PREVIEW

Causal Theories: A Categorical Approach to Bayesian Networks - - PowerPoint PPT Presentation

Causal Theories: A Categorical Approach to Bayesian Networks Brendan Fong, University of Oxford PSU Applied Algebra and Network Theory Seminar. 22 April 2015 Task: formalise how Bayesian networks guide our reasoning about a collection of


slide-1
SLIDE 1

Causal Theories: A Categorical Approach to Bayesian Networks

Brendan Fong, University of Oxford

PSU Applied Algebra and Network Theory Seminar. 22 April 2015

slide-2
SLIDE 2

Task: formalise how Bayesian networks guide

  • ur reasoning about a collection of random

variables.

slide-3
SLIDE 3

Test: Simpson’s paradox

What should we conclude from the following data? TB t,b ¬t,b t,¬b ¬t,¬b R r 30 40 9 2 ¬r 10 40 51 18

  • T

t ¬t R r 39 42 ¬r 61 58

slide-4
SLIDE 4

Test: Simpson’s paradox

What should we conclude from the following data? TB t,b ¬t,b t,¬b ¬t,¬b R r 30 40 9 2 ¬r 10 40 51 18

  • T

t ¬t R r 39 42 ¬r 61 58

slide-5
SLIDE 5

The big picture

Network-style diagrammatic languages have developed to represent and reason about many different sciences: Why? Can we formalise their key features and relationships? Can we unify them? I work with John Baez and his team to understand these questions from a category-theoretic viewpoint.

slide-6
SLIDE 6

The big picture

Network-style diagrammatic languages have developed to represent and reason about many different sciences: Why? Can we formalise their key features and relationships? Can we unify them? I work with John Baez and his team to understand these questions from a category-theoretic viewpoint.

slide-7
SLIDE 7

Outline

  • 1. Categories
  • 2. Lawvere theories
  • 3. Bayesian networks
  • 4. Causal theories
  • 5. An application to Simpson’s paradox
slide-8
SLIDE 8

Categories

○ Categories are a great algebraic framework for discussing interconnection, or composition. They first arose in the 1940s in algebraic topology. ○ From the 1980s, it became clear they had a role to play in formalising uses of string/network diagrams, such as Feynman diagrams:

slide-9
SLIDE 9

Categories

○ Categories are a great algebraic framework for discussing interconnection, or composition. They first arose in the 1940s in algebraic topology. ○ From the 1980s, it became clear they had a role to play in formalising uses of string/network diagrams, such as Feynman diagrams:

slide-10
SLIDE 10

Categories

○ Categories are a great algebraic framework for discussing interconnection, or composition. They first arose in the 1940s in algebraic topology. ○ From the 1980s, it became clear they had a role to play in formalising uses of string/network diagrams, such as Feynman diagrams:

slide-11
SLIDE 11

Categories

○ A category C is the structure of one-dimensional flow charts. ○ They comprise objects, or types: X,Y,etc. together with morphisms between these types: f

X Y

○ We can compose morphisms of matching types to get new morphisms: f g h

X

W

Y Z

The composition rule must have such pictures unambiguously describe a morphism.

slide-12
SLIDE 12

Categories

○ A category C is the structure of one-dimensional flow charts. ○ They comprise objects, or types: X,Y,etc. together with morphisms between these types: f

X Y

○ We can compose morphisms of matching types to get new morphisms: f g h

X

W

Y Z

The composition rule must have such pictures unambiguously describe a morphism.

slide-13
SLIDE 13

Categories

○ A category C is the structure of one-dimensional flow charts. ○ They comprise objects, or types: X,Y,etc. together with morphisms between these types: f

X Y

○ We can compose morphisms of matching types to get new morphisms: f g h

X

W

Y Z

The composition rule must have such pictures unambiguously describe a morphism.

slide-14
SLIDE 14

Categories

○ There are various types of categories that allow extra

  • perations.

○ A monoidal category (C,⊗) is the structure of two-dimensional flow charts. f g h k

X Y X Z V

V W X

The key point is that we have a notion of ‘parallel’ or tensor composition. ○ A symmetric monoidal category further allows you to cross wires:

X X Y Y

slide-15
SLIDE 15

Categories

○ There are various types of categories that allow extra

  • perations.

○ A monoidal category (C,⊗) is the structure of two-dimensional flow charts. f g h k

X Y X Z V

V W X

The key point is that we have a notion of ‘parallel’ or tensor composition. ○ A symmetric monoidal category further allows you to cross wires:

X X Y Y

slide-16
SLIDE 16

Categories

○ There are various types of categories that allow extra

  • perations.

○ A monoidal category (C,⊗) is the structure of two-dimensional flow charts. f g h k

X Y X Z V

V W X

The key point is that we have a notion of ‘parallel’ or tensor composition. ○ A symmetric monoidal category further allows you to cross wires:

X X Y Y

slide-17
SLIDE 17

Categories

○ A functor F∶C → D is a map between categories. ○ It turns morphisms f

X Y

in C into morphisms Ff

FX FY

in D. This assignment must preserve composition. That is, the diagram Ff Fg

FX

FY

FZ

must be unambiguous. ○ A monoidal functor F∶(C,×) → (D,⊠) also preserves the tensor.

slide-18
SLIDE 18

Categories

○ A functor F∶C → D is a map between categories. ○ It turns morphisms f

X Y

in C into morphisms Ff

FX FY

in D. This assignment must preserve composition. That is, the diagram Ff Fg

FX

FY

FZ

must be unambiguous. ○ A monoidal functor F∶(C,×) → (D,⊠) also preserves the tensor.

slide-19
SLIDE 19

Categories

○ A functor F∶C → D is a map between categories. ○ It turns morphisms f

X Y

in C into morphisms Ff

FX FY

in D. This assignment must preserve composition. That is, the diagram Ff Fg

FX

FY

FZ

must be unambiguous. ○ A monoidal functor F∶(C,×) → (D,⊠) also preserves the tensor.

slide-20
SLIDE 20

Lawvere theories

○ Lawvere theories were developed by William Lawvere in his 1963 doctoral thesis as a categorical approach to universal algebra. ○ Let S be a set; we call the elements of this set sorts. A (multisorted) Lawvere theory T is a category with finite products such that every object is a finite product of sorts. ○ Each Lawvere theory T is a monoidal category (T,×) with tensor the product ×. ○ If (C,⊗) is a monoidal category, and T is a Lawvere theory, a model of T in (C,⊗) is a monoidal functor F∶(T,×) → (C,⊗).

slide-21
SLIDE 21

Lawvere theories

○ Lawvere theories were developed by William Lawvere in his 1963 doctoral thesis as a categorical approach to universal algebra. ○ Let S be a set; we call the elements of this set sorts. A (multisorted) Lawvere theory T is a category with finite products such that every object is a finite product of sorts. ○ Each Lawvere theory T is a monoidal category (T,×) with tensor the product ×. ○ If (C,⊗) is a monoidal category, and T is a Lawvere theory, a model of T in (C,⊗) is a monoidal functor F∶(T,×) → (C,⊗).

slide-22
SLIDE 22

Lawvere theories

○ Lawvere theories were developed by William Lawvere in his 1963 doctoral thesis as a categorical approach to universal algebra. ○ Let S be a set; we call the elements of this set sorts. A (multisorted) Lawvere theory T is a category with finite products such that every object is a finite product of sorts. ○ Each Lawvere theory T is a monoidal category (T,×) with tensor the product ×. ○ If (C,⊗) is a monoidal category, and T is a Lawvere theory, a model of T in (C,⊗) is a monoidal functor F∶(T,×) → (C,⊗).

slide-23
SLIDE 23

Lawvere theories

○ Lawvere theories were developed by William Lawvere in his 1963 doctoral thesis as a categorical approach to universal algebra. ○ Let S be a set; we call the elements of this set sorts. A (multisorted) Lawvere theory T is a category with finite products such that every object is a finite product of sorts. ○ Each Lawvere theory T is a monoidal category (T,×) with tensor the product ×. ○ If (C,⊗) is a monoidal category, and T is a Lawvere theory, a model of T in (C,⊗) is a monoidal functor F∶(T,×) → (C,⊗).

slide-24
SLIDE 24

Lawvere theories

Example: the theory of groups

○ Let TG have objects Xn for n ∈ N, and morphisms generated by µ∶X × X → X, η∶1 → X, and ι∶X → X subject to: µ µ = µ µ (Assoc) η µ = = η µ (Id) ι µ = η (Inv) ι ι = ○ A model in (Set,×) is a group. ○ A model in (Top,×) is a topological group.

slide-25
SLIDE 25

Lawvere theories

Example: the theory of groups

○ Let TG have objects Xn for n ∈ N, and morphisms generated by µ∶X × X → X, η∶1 → X, and ι∶X → X subject to: µ µ = µ µ (Assoc) η µ = = η µ (Id) ι µ = η (Inv) ι ι = ○ A model in (Set,×) is a group. ○ A model in (Top,×) is a topological group.

slide-26
SLIDE 26

Lawvere theories

Example: the theory of groups

○ Let TG have objects Xn for n ∈ N, and morphisms generated by µ∶X × X → X, η∶1 → X, and ι∶X → X subject to: µ µ = µ µ (Assoc) η µ = = η µ (Id) ι µ = η (Inv) ι ι = ○ A model in (Set,×) is a group. ○ A model in (Top,×) is a topological group.

slide-27
SLIDE 27

Outline

  • 1. Categories
  • 2. Lawvere theories
  • 3. Bayesian networks
  • 4. Causal theories
  • 5. An application to Simpson’s paradox
slide-28
SLIDE 28

Bayesian networks

○ Bayesian networks represent conditional dependencies between random variables using directed acyclic graphs (DAGs). They were developed by Judea Pearl in the 1980s. ○ In a DAG, we say that a vertex u is a parent of a vertex v if there exists an edge from u to v. We write pa(v) for the set

  • f parents of v.

○ We say that u is a descendant of v if there exists a directed path from u to v. We write de(v) for the set of descendants

  • f v.

○ A Bayesian network is a DAG G together with a set of random variables indexed by the vertices V of G, such that each variable is conditionally independent of its non-descendants given its parents. That is, for all v ∈ V, P(Xv,XV∖de(v) ∣ Xpa(v)) = P(Xv ∣ Xpa(v))P(XV∖de(v) ∣ Xpa(v))

slide-29
SLIDE 29

Bayesian networks

○ Bayesian networks represent conditional dependencies between random variables using directed acyclic graphs (DAGs). They were developed by Judea Pearl in the 1980s. ○ In a DAG, we say that a vertex u is a parent of a vertex v if there exists an edge from u to v. We write pa(v) for the set

  • f parents of v.

○ We say that u is a descendant of v if there exists a directed path from u to v. We write de(v) for the set of descendants

  • f v.

○ A Bayesian network is a DAG G together with a set of random variables indexed by the vertices V of G, such that each variable is conditionally independent of its non-descendants given its parents. That is, for all v ∈ V, P(Xv,XV∖de(v) ∣ Xpa(v)) = P(Xv ∣ Xpa(v))P(XV∖de(v) ∣ Xpa(v))

slide-30
SLIDE 30

Bayesian networks

○ Bayesian networks represent conditional dependencies between random variables using directed acyclic graphs (DAGs). They were developed by Judea Pearl in the 1980s. ○ In a DAG, we say that a vertex u is a parent of a vertex v if there exists an edge from u to v. We write pa(v) for the set

  • f parents of v.

○ We say that u is a descendant of v if there exists a directed path from u to v. We write de(v) for the set of descendants

  • f v.

○ A Bayesian network is a DAG G together with a set of random variables indexed by the vertices V of G, such that each variable is conditionally independent of its non-descendants given its parents. That is, for all v ∈ V, P(Xv,XV∖de(v) ∣ Xpa(v)) = P(Xv ∣ Xpa(v))P(XV∖de(v) ∣ Xpa(v))

slide-31
SLIDE 31

Bayesian networks

○ Bayesian networks represent conditional dependencies between random variables using directed acyclic graphs (DAGs). They were developed by Judea Pearl in the 1980s. ○ In a DAG, we say that a vertex u is a parent of a vertex v if there exists an edge from u to v. We write pa(v) for the set

  • f parents of v.

○ We say that u is a descendant of v if there exists a directed path from u to v. We write de(v) for the set of descendants

  • f v.

○ A Bayesian network is a DAG G together with a set of random variables indexed by the vertices V of G, such that each variable is conditionally independent of its non-descendants given its parents. That is, for all v ∈ V, P(Xv,XV∖de(v) ∣ Xpa(v)) = P(Xv ∣ Xpa(v))P(XV∖de(v) ∣ Xpa(v))

slide-32
SLIDE 32

Bayesian networks

○ For example, suppose we have a Bayesian network on the following DAG: ○ Then if we know the road conditions and the average speed, knowing the season tells us no more about the danger level, and vice versa. ○ This matches our intuitions about causal reasoning.

slide-33
SLIDE 33

Bayesian networks

○ For example, suppose we have a Bayesian network on the following DAG: ○ Then if we know the road conditions and the average speed, knowing the season tells us no more about the danger level, and vice versa. ○ This matches our intuitions about causal reasoning.

slide-34
SLIDE 34

Bayesian networks

○ For example, suppose we have a Bayesian network on the following DAG: ○ Then if we know the road conditions and the average speed, knowing the season tells us no more about the danger level, and vice versa. ○ This matches our intuitions about causal reasoning.

slide-35
SLIDE 35

Causal theories

○ Given a DAG G with vertices V, the causal theory CG corresponding to it is the V-sorted Lawvere theory freely generated by morphisms [v∣pa(v)]∶pa(v) → v for all v ∈ V. ○ For example, the causal theory corresponding to A B C has as objects words in A,B,C and morphisms generated by A B and C∣AB . ○ For all w,z ⊆ V, there exists a morphism [w∣∣z]∶z → w in CG, known as the causal conditional, that expresses the causal influence of z on w.

slide-36
SLIDE 36

Causal theories

○ Given a DAG G with vertices V, the causal theory CG corresponding to it is the V-sorted Lawvere theory freely generated by morphisms [v∣pa(v)]∶pa(v) → v for all v ∈ V. ○ For example, the causal theory corresponding to A B C has as objects words in A,B,C and morphisms generated by A B and C∣AB . ○ For all w,z ⊆ V, there exists a morphism [w∣∣z]∶z → w in CG, known as the causal conditional, that expresses the causal influence of z on w.

slide-37
SLIDE 37

Causal theories

○ Given a DAG G with vertices V, the causal theory CG corresponding to it is the V-sorted Lawvere theory freely generated by morphisms [v∣pa(v)]∶pa(v) → v for all v ∈ V. ○ For example, the causal theory corresponding to A B C has as objects words in A,B,C and morphisms generated by A B and C∣AB . ○ For all w,z ⊆ V, there exists a morphism [w∣∣z]∶z → w in CG, known as the causal conditional, that expresses the causal influence of z on w.

slide-38
SLIDE 38

Causal theories

○ We are particularly interested in models of causal theories in the category Stoch of finite sets and stochastic matrices. ○ This has objects finite sets. ○ Given finite sets X and Y, a morphism X → Y is an ∣Y∣ × ∣X∣-matrix, with entries in (0,1), such that each column sums to 1. For example, if X has 4 elements and Y has 2 elements (0.75 0.15 0.5 0.1 0.25 0.85 0.5 0.9) is a morphism X → Y. ○ Composition is by matrix multiplication, tensor by Cartesian product of sets and tensor product of matrices.

slide-39
SLIDE 39

Causal theories

○ We are particularly interested in models of causal theories in the category Stoch of finite sets and stochastic matrices. ○ This has objects finite sets. ○ Given finite sets X and Y, a morphism X → Y is an ∣Y∣ × ∣X∣-matrix, with entries in (0,1), such that each column sums to 1. For example, if X has 4 elements and Y has 2 elements (0.75 0.15 0.5 0.1 0.25 0.85 0.5 0.9) is a morphism X → Y. ○ Composition is by matrix multiplication, tensor by Cartesian product of sets and tensor product of matrices.

slide-40
SLIDE 40

Causal theories

○ We are particularly interested in models of causal theories in the category Stoch of finite sets and stochastic matrices. ○ This has objects finite sets. ○ Given finite sets X and Y, a morphism X → Y is an ∣Y∣ × ∣X∣-matrix, with entries in (0,1), such that each column sums to 1. For example, if X has 4 elements and Y has 2 elements (0.75 0.15 0.5 0.1 0.25 0.85 0.5 0.9) is a morphism X → Y. ○ Composition is by matrix multiplication, tensor by Cartesian product of sets and tensor product of matrices.

slide-41
SLIDE 41

Causal theories

○ We are particularly interested in models of causal theories in the category Stoch of finite sets and stochastic matrices. ○ This has objects finite sets. ○ Given finite sets X and Y, a morphism X → Y is an ∣Y∣ × ∣X∣-matrix, with entries in (0,1), such that each column sums to 1. For example, if X has 4 elements and Y has 2 elements (0.75 0.15 0.5 0.1 0.25 0.85 0.5 0.9) is a morphism X → Y. ○ Composition is by matrix multiplication, tensor by Cartesian product of sets and tensor product of matrices.

slide-42
SLIDE 42

Causal theories

○ We are particularly interested in models of causal theories in the category Stoch of finite sets and stochastic matrices. ○ This has objects finite sets. ○ Given finite sets X and Y, a morphism X → Y is an ∣Y∣ × ∣X∣-matrix, with entries in (0,1), such that each column sums to 1. For example, if X has 4 elements and Y has 2 elements (0.75 0.15 0.5 0.1 0.25 0.85 0.5 0.9) is a morphism X → Y. ○ Composition is by matrix multiplication, tensor by Cartesian product of sets and tensor product of matrices.

slide-43
SLIDE 43

Causal theories

Recall that a random variable X induces a probability distribution on its set of possible values. We call two random variables equivalent if they induce the same distribution, and two Bayesian networks equivalent if their induced joint distributions are equivalent.

slide-44
SLIDE 44

Causal theories

Theorem Let G be a DAG. The models of the causal theory CG in Stoch are in

  • ne-to-one correspondence with equivalence classes of Bayesian

networks on G such that the random variables corresponding to its vertices are finite and of full support.

slide-45
SLIDE 45

Outline

  • 1. Categories
  • 2. Lawvere theories
  • 3. Bayesian networks
  • 4. Causal theories
  • 5. An application to Simpson’s paradox
slide-46
SLIDE 46

Simpson’s paradox

Suppose we trial a treatment with the variables: T: the patient receives treatment for heart condition. R: the patient has recovered at the conclusion of treatment. B: the patient has healthy blood pressure at post-treatment checkup. What should we conclude from the following data? TB t,b ¬t,b t,¬b ¬t,¬b R r 30 40 9 2 ¬r 10 40 51 18

slide-47
SLIDE 47

Simpson’s paradox

Let’s look at their causal theories. There are two causal structures we might consider: T R B T R B The morphisms in their corresponding causal theories lead us to the two different conclusions.

slide-48
SLIDE 48

Simpson’s paradox

Let’s look at their causal theories. There are two causal structures we might consider: T R B T R B The morphisms in their corresponding causal theories lead us to the two different conclusions.

slide-49
SLIDE 49

Simpson’s paradox

Let’s look at their causal theories. There are two causal structures we might consider: T R B T R B The morphisms in their corresponding causal theories lead us to the two different conclusions.

slide-50
SLIDE 50

Simpson’s paradox

In the first case we have the causal theory and model in Stoch given by T R B T = (0.5 0.5) B∣T = (0.4 0.8 0.6 0.2) R∣TB = (0.75 0.15 0.5 0.1 0.25 0.85 0.5 0.9)

slide-51
SLIDE 51

Simpson’s paradox

This leads us to reason via the morphism T R B P[R∣∣T] = B∣T R∣TB T R = (0.39 0.42 0.61 0.58)

slide-52
SLIDE 52

Simpson’s paradox

In the second case we have the causal theory and model in Stoch given by T R B T = (0.5 0.5) B = (0.6 0.4) R∣TB = (0.75 0.15 0.5 0.1 0.25 0.85 0.5 0.9)

slide-53
SLIDE 53

Simpson’s paradox

We thus reason by the morphism T R B Q[R∣∣T] = B R∣TB T R = (0.51 0.34 0.49 0.66)

slide-54
SLIDE 54

Simpson’s paradox

Thus if the treatment acts via affecting blood pressure, it is

  • ineffective. On the other hand, if it is the blood pressure

affecting the efficacy of the treatment, then the treatment is effective.

slide-55
SLIDE 55

Thanks for listening. More details: arXiv:1301.6201. Slides will be available at http://www.cs.ox.ac.uk/people/brendan.fong/. This part of John Baez’s networks programme; see http://math.ucr.edu/baez/networks/.