Probabilistic & Unsupervised Learning Graphical Models Maneesh - - PowerPoint PPT Presentation

probabilistic unsupervised learning graphical models
SMART_READER_LITE
LIVE PREVIEW

Probabilistic & Unsupervised Learning Graphical Models Maneesh - - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Graphical Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2017 Graphs,


slide-1
SLIDE 1

Probabilistic & Unsupervised Learning Graphical Models

Maneesh Sahani

maneesh@gatsby.ucl.ac.uk

Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2017

slide-2
SLIDE 2

Graphs, independence and factorisation.

y1 y2 y3 yT x1 x2 x3 xT

  • • •

The (Markov) independence structure of a latent chain model implied that the joint-data likelihood factorised: P(X, Y) = P(y1)P(x1|y1)

T

  • t=1

P(yt|yt−1)P(xt|yt)

slide-3
SLIDE 3

Graphs, independence and factorisation.

y1 y2 y3 yT x1 x2 x3 xT

  • • •

The (Markov) independence structure of a latent chain model implied that the joint-data likelihood factorised: P(X, Y) = P(y1)P(x1|y1)

T

  • t=1

P(yt|yt−1)P(xt|yt) We exploited the factored form to obtain local O(T) learning algorithms.

slide-4
SLIDE 4

Graphs, independence and factorisation.

y1 y2 y3 yT x1 x2 x3 xT

  • • •

The (Markov) independence structure of a latent chain model implied that the joint-data likelihood factorised: P(X, Y) = P(y1)P(x1|y1)

T

  • t=1

P(yt|yt−1)P(xt|yt) We exploited the factored form to obtain local O(T) learning algorithms.

◮ Learning: requires only local marginals of posterior

slide-5
SLIDE 5

Graphs, independence and factorisation.

y1 y2 y3 yT x1 x2 x3 xT

  • • •

The (Markov) independence structure of a latent chain model implied that the joint-data likelihood factorised: P(X, Y) = P(y1)P(x1|y1)

T

  • t=1

P(yt|yt−1)P(xt|yt) We exploited the factored form to obtain local O(T) learning algorithms.

◮ Learning: requires only local marginals of posterior ◮ Inference: local marginals found by passing local messages

slide-6
SLIDE 6

Graphs, independence and factorisation.

y1 y2 y3 yT x1 x2 x3 xT

  • • •

The (Markov) independence structure of a latent chain model implied that the joint-data likelihood factorised: P(X, Y) = P(y1)P(x1|y1)

T

  • t=1

P(yt|yt−1)P(xt|yt) We exploited the factored form to obtain local O(T) learning algorithms.

◮ Learning: requires only local marginals of posterior ◮ Inference: local marginals found by passing local messages

The independence structure of the model (and the factorisation of its likelihood) is encoded in its graph.

slide-7
SLIDE 7

Varieties of graphical model

A B C D E A B C D E A B C D E

factor graph undirected graph directed graph

A B C D E A B C D E A B C D E

bidirected graph chain graph mixed graph

◮ Nodes in the graph correspond to random variables. ◮ Edges in graph indicate statistical dependence between the variables. ◮ (Absent edges signal (conditional) independence between variables).

slide-8
SLIDE 8

Why the graph?

◮ Gives an intuitive representation of the relationships amongst many variables, possibly

embodying prior beliefs or knowledge about causal relationships. (Examples: inheritance in family trees, noise in electric circuits, neural networks)

slide-9
SLIDE 9

Why the graph?

◮ Gives an intuitive representation of the relationships amongst many variables, possibly

embodying prior beliefs or knowledge about causal relationships. (Examples: inheritance in family trees, noise in electric circuits, neural networks)

◮ Provides a precise syntax to describe these relationships, and to infer any implied

(in)dependencies amongst larger groups of variables. Is A⊥

⊥E|{B, C}?

slide-10
SLIDE 10

Why the graph?

◮ Gives an intuitive representation of the relationships amongst many variables, possibly

embodying prior beliefs or knowledge about causal relationships. (Examples: inheritance in family trees, noise in electric circuits, neural networks)

◮ Provides a precise syntax to describe these relationships, and to infer any implied

(in)dependencies amongst larger groups of variables. Is A⊥

⊥E|{B, C}?

◮ Each graphical structure corresponds to a parametric family of distributions that satisfy

all the implied (in)dependencies.

slide-11
SLIDE 11

Why the graph?

◮ Gives an intuitive representation of the relationships amongst many variables, possibly

embodying prior beliefs or knowledge about causal relationships. (Examples: inheritance in family trees, noise in electric circuits, neural networks)

◮ Provides a precise syntax to describe these relationships, and to infer any implied

(in)dependencies amongst larger groups of variables. Is A⊥

⊥E|{B, C}?

◮ Each graphical structure corresponds to a parametric family of distributions that satisfy

all the implied (in)dependencies.

◮ Graph-based manipulations allow us to identify the sufficient statististics of these

distributions needed for learning, and to construct general-purpose message-passing algorithms that implement inference efficiently. Find P(A|C = c) without enumerating all settings of B, D, E . . .

slide-12
SLIDE 12

Types of independence

For events or random variables X, Y, V: Conditional Independence: X⊥

⊥Y|V ⇔

P(X|Y, V) = P(X|V) [provided, for events, P(Y, V) > 0] Thus, X⊥

⊥Y|V ⇔

P(X, Y|V) = P(X|Y, V)P(Y|V) = P(X|V)P(Y|V) We can generalise to conditional independence between sets of random variables:

X⊥ ⊥Y|V ⇔ {X⊥ ⊥Y|V, ∀X ∈ X and ∀Y ∈ Y}

Marginal Independence: X⊥

⊥Y ⇔

X⊥

⊥Y|∅ ⇔

P(X, Y) = P(X)P(Y)

slide-13
SLIDE 13

Factor graphs

A B C D E A B C D E (a) (b) A factor graph is a direct graphical representation of the factorised model structure: each square indicates a factor that depends on the linked variables. P(X) = 1 Z

  • j

fj(XCj ) where X = {X1, . . . , XK}, XS = {Xi : i ∈ S}, j indexes the factors, Cj contains the indices

  • f variables adjacent to factor j, fj is the factor function (also called the factor potential or

clique potential) and Z is a normalisation constant.

slide-14
SLIDE 14

Factor graphs: examples

A B C D E A B C D E (a) (b) Examples: (a) P(A, B, C, D, E) =

1 Za f1(A, C)f2(B, C, D)f3(C, D, E)

(b) P(A, B, C, D, E) =

1 Zb f1(A, C)f2(B, C)f3(C, D)f4(B, D)f5(C, E)f6(D, E)

and [e.g.]: Za =

  • a∈A
  • b∈B
  • c∈C
  • d∈D
  • e∈E

f1(a, c)f2(b, c, d)f3(c, d, e) where A, B, C, D and E are the domains of the corresponding random variables.

slide-15
SLIDE 15

Factor graphs: conditional independence

A B C D E A B C D E (a) (b) Conditional independence: X⊥

⊥Y|V if every path between X and Y contains some V ∈ V.

slide-16
SLIDE 16

Factor graphs: conditional independence

A B C D E A B C D E (a) (b) Conditional independence: X⊥

⊥Y|V if every path between X and Y contains some V ∈ V.

In both graphs: A⊥

⊥D|C

slide-17
SLIDE 17

Factor graphs: conditional independence

A B C D E A B C D E (a) (b) Conditional independence: X⊥

⊥Y|V if every path between X and Y contains some V ∈ V.

In both graphs: A⊥

⊥D|C

B⊥

⊥E|C

slide-18
SLIDE 18

Factor graphs: conditional independence

A B C D E A B C D E (a) (b) Conditional independence: X⊥

⊥Y|V if every path between X and Y contains some V ∈ V.

In both graphs: A⊥

⊥D|C

B⊥

⊥E|C

B⊥

⊥E|{C, D}

slide-19
SLIDE 19

Factorisation and conditional independence

A B C D E Every path between X and Y contains some V ∈ V

slide-20
SLIDE 20

Factorisation and conditional independence

A B C D E gE gB Every path between X and Y contains some V ∈ V ⇒ there exists a factorisation: P(X, Y, V, . . . ) = 1 Z gX(X, VX, QX)gY(Y, VY, QY)gR(QR, VR) where Vx, Vy, VR ⊆ V and the sets of remaining variables QX, QY and QR are disjoint.

slide-21
SLIDE 21

Factorisation and conditional independence

A B C D E gE gB Every path between X and Y contains some V ∈ V ⇒ there exists a factorisation: P(X, Y, V, . . . ) = 1 Z gX(X, VX, QX)gY(Y, VY, QY)gR(QR, VR) where Vx, Vy, VR ⊆ V and the sets of remaining variables QX, QY and QR are disjoint.

⇒ P(X|Y, V, . . . ) = P(X, Y, V, . . . )

P(Y, V, . . . )

=

1 Z gX(X, VX, QX)gY(Y, VY, QY)gR(QR, VR)

  • X′

1 Z gX(X ′, VX, QX)gY(Y, VY, QY)gR(QR, VR)

=

gX(X, VX, QX)

  • X′ gX(X ′, VX, QX)
slide-22
SLIDE 22

Factorisation and conditional independence

A B C D E gE gB Every path between X and Y contains some V ∈ V ⇒ there exists a factorisation: P(X, Y, V, . . . ) = 1 Z gX(X, VX, QX)gY(Y, VY, QY)gR(QR, VR) where Vx, Vy, VR ⊆ V and the sets of remaining variables QX, QY and QR are disjoint.

⇒ P(X|Y, V, . . . ) = P(X, Y, V, . . . )

P(Y, V, . . . )

=

1 Z gX(X, VX, QX)gY(Y, VY, QY)gR(QR, VR)

  • X′

1 Z gX(X ′, VX, QX)gY(Y, VY, QY)gR(QR, VR)

=

gX(X, VX, QX)

  • X′ gX(X ′, VX, QX)

Since the RHS does not depend on Y, it follows that X⊥

⊥Y|V.

slide-23
SLIDE 23

Factor graphs: neighbourhoods and Markov boundaries

A B C D E A B C D E (a) (b)

◮ Variables are neighbours if they share a common factor; the neighbourhood ne(X) is the

set of all neighbours of X.

slide-24
SLIDE 24

Factor graphs: neighbourhoods and Markov boundaries

A B C D E A B C D E (a) (b)

◮ Variables are neighbours if they share a common factor; the neighbourhood ne(X) is the

set of all neighbours of X.

◮ Each variable X is conditionally independent of all non-neighbours given its

neighbours: X⊥

⊥Y| ne(X), ∀Y / ∈ {X ∪ ne(X)} ⇒ ne(X) is a Markov blanket for X.

slide-25
SLIDE 25

Factor graphs: neighbourhoods and Markov boundaries

A B C D E A B C D E (a) (b)

◮ Variables are neighbours if they share a common factor; the neighbourhood ne(X) is the

set of all neighbours of X.

◮ Each variable X is conditionally independent of all non-neighbours given its

neighbours: X⊥

⊥Y| ne(X), ∀Y / ∈ {X ∪ ne(X)} ⇒ ne(X) is a Markov blanket for X.

◮ In fact, the neighbourhood is the minimal such set: the Markov boundary.

slide-26
SLIDE 26

Undirected graphical models: Markov networks

A B C D E An undirected graphical model is a direct representation of conditional independence

  • structure. Nodes are connected iff they are conditionally dependent given all others.

⇒ neighbours (connected nodes) in a Markov net share a factor.

slide-27
SLIDE 27

Undirected graphical models: Markov networks

A B C D E An undirected graphical model is a direct representation of conditional independence

  • structure. Nodes are connected iff they are conditionally dependent given all others.

⇒ neighbours (connected nodes) in a Markov net share a factor. ⇒ non-neighbours (disconnected nodes) in a Markov net cannot share a factor.

slide-28
SLIDE 28

Undirected graphical models: Markov networks

A B C D E An undirected graphical model is a direct representation of conditional independence

  • structure. Nodes are connected iff they are conditionally dependent given all others.

⇒ neighbours (connected nodes) in a Markov net share a factor. ⇒ non-neighbours (disconnected nodes) in a Markov net cannot share a factor. ⇒ the joint probability factors over the maximal cliques Cj of the graph:

P(X) = 1 Z

  • j

fj(XCj ) It may also factor more finely (as we will see in a moment).

[Cliques are fully connected subgraphs, maximal cliques are cliques not contained in other cliques.]

slide-29
SLIDE 29

Undirected graphs: Markov boundaries

A B C D E

◮ X⊥

⊥Y|V if every path between X and Y contains some node V ∈ V

◮ Each variable X is conditionally independent of all non-neighbours given its neighbours:

X⊥

⊥Y| ne(X), ∀Y / ∈ {X ∪ ne(X)}

◮ V is a Markov blanket for X iff X⊥

⊥Y|V for all Y / ∈ {X ∪ V}.

◮ Markov boundary: minimal Markov blanket. For undirected graphs (like factor graphs)

this is the set of neighbours of X.

slide-30
SLIDE 30

Undirected graphs and factor graphs

A B C D E A B C D E A B C D E

(a) (b) (c)

◮ Each node has the same neighbours in each graph, so (a), (b) and (c) represent exactly

the same conditional independence relationships.

◮ The implied maximal factorisations differ: (b) has two three-way factors; (c) has only

pairwise factors; (a) cannot distinguish between these (so we have to adopt factorisation (b) to be safe).

◮ Suppose all variables are discrete and can take on K possible values. Then the

functions in (a) and (b) are tables with O(K 3) cells, whereas in (c) they are O(K 2).

◮ Factor graphs have richer expressive power than undirected graphical models. ◮ Factors cannot be determined solely by testing for conditional independence.

slide-31
SLIDE 31

Some examples of undirected graphical models

◮ Markov random fields (used in computer vision) ◮ Maximum entropy language models (used in speech and language modelling)

P(X) = 1 Z p0(X) exp

  • j

λjgj(X)

  • ◮ Conditional random fields are undirected graphical models (conditioned on the input

variables).

◮ Boltzmann machines (a kind of neural network/Ising model)

slide-32
SLIDE 32

Limitations of undirected and factor graphs

Undirected and factor graphs fail to capture some useful independencies—a pair of variables may be connected merely because some other variable depends on them: The classic example (due to Pearl):

Rain Sprinkler Ground wet Rain Sprinkler Ground wet

slide-33
SLIDE 33

Limitations of undirected and factor graphs

Undirected and factor graphs fail to capture some useful independencies—a pair of variables may be connected merely because some other variable depends on them: The classic example (due to Pearl):

Rain Sprinkler Ground wet Rain Sprinkler Ground wet

◮ Most sprinklers switch on come rain or shine; and certainly the weather pays no heed to

the state of the spinklers.

slide-34
SLIDE 34

Limitations of undirected and factor graphs

Undirected and factor graphs fail to capture some useful independencies—a pair of variables may be connected merely because some other variable depends on them: The classic example (due to Pearl):

Rain Sprinkler Ground wet Rain Sprinkler Ground wet

◮ Most sprinklers switch on come rain or shine; and certainly the weather pays no heed to

the state of the spinklers.

◮ Explaining away: Damp ground suggests that it has rained; but if we also see a running

sprinkler this explains away the damp, returning our belief about rain to the prior.

slide-35
SLIDE 35

Limitations of undirected and factor graphs

Undirected and factor graphs fail to capture some useful independencies—a pair of variables may be connected merely because some other variable depends on them: The classic example (due to Pearl):

Rain Sprinkler Ground wet Rain Sprinkler Ground wet

◮ Most sprinklers switch on come rain or shine; and certainly the weather pays no heed to

the state of the spinklers.

◮ Explaining away: Damp ground suggests that it has rained; but if we also see a running

sprinkler this explains away the damp, returning our belief about rain to the prior.

◮ R⊥

⊥S|∅ but R⊥ ⊥S|G.

slide-36
SLIDE 36

Limitations of undirected and factor graphs

Undirected and factor graphs fail to capture some useful independencies—a pair of variables may be connected merely because some other variable depends on them: The classic example (due to Pearl):

Rain Sprinkler Ground wet Rain Sprinkler Ground wet

◮ Most sprinklers switch on come rain or shine; and certainly the weather pays no heed to

the state of the spinklers.

◮ Explaining away: Damp ground suggests that it has rained; but if we also see a running

sprinkler this explains away the damp, returning our belief about rain to the prior.

◮ R⊥

⊥S|∅ but R⊥ ⊥S|G.

This highlights the difference between marginal and conditional independence.

slide-37
SLIDE 37

Directed acyclic graphical models

A B C D E A directed acyclic graphical (DAG) model represents a factorization of the joint probability distribution in terms of conditionals: P(A, B, C, D, E) = P(A)P(B)P(C|A, B)P(D|B, C)P(E|C, D) In general: P(X1, . . . , Xn) =

n

  • i=1

P(Xi|Xpa(i)) where pa(i) are the parents of node i. DAG models are also known as Bayesian networks or Bayes nets.

slide-38
SLIDE 38

Conditional independence in DAGs

A B C D E Reading conditional independence from DAGs is more complicated than in undirected graphs.

slide-39
SLIDE 39

Conditional independence in DAGs

A E B C D Reading conditional independence from DAGs is more complicated than in undirected graphs.

  • A⊥

⊥E | {B, C}:

conditioning nodes block paths

slide-40
SLIDE 40

Conditional independence in DAGs

A B C D E Reading conditional independence from DAGs is more complicated than in undirected graphs.

  • A⊥

⊥E | {B, C}:

conditioning nodes block paths

  • A⊥

⊥B | ∅:

  • ther nodes block reflected paths
slide-41
SLIDE 41

Conditional independence in DAGs

A B C D E Reading conditional independence from DAGs is more complicated than in undirected graphs.

  • A⊥

⊥E | {B, C}:

conditioning nodes block paths

  • A⊥

⊥B | ∅:

  • ther nodes block reflected paths
  • A⊥

⊥B | C:

conditioning node creates a reflected path by explaining away

slide-42
SLIDE 42

Conditional independence in DAGs

A E C B D Reading conditional independence from DAGs is more complicated than in undirected graphs.

  • A⊥

⊥E | {B, C}:

conditioning nodes block paths

  • A⊥

⊥B | ∅:

  • ther nodes block reflected paths
  • A⊥

⊥B | C:

conditioning node creates a reflected path by explaining away

  • A⊥

⊥E | C:

the created path extends to E via D

slide-43
SLIDE 43

Conditional independence in DAGs

A E C D B Reading conditional independence from DAGs is more complicated than in undirected graphs.

  • A⊥

⊥E | {B, C}:

conditioning nodes block paths

  • A⊥

⊥B | ∅:

  • ther nodes block reflected paths
  • A⊥

⊥B | C:

conditioning node creates a reflected path by explaining away

  • A⊥

⊥E | C:

the created path extends to E via D

  • A⊥

⊥E | {C, D}:

but is blocked by observing D

slide-44
SLIDE 44

Conditional independence in DAGs

A B C D E Reading conditional independence from DAGs is more complicated than in undirected graphs.

  • A⊥

⊥E | {B, C}:

conditioning nodes block paths

  • A⊥

⊥B | ∅:

  • ther nodes block reflected paths
  • A⊥

⊥B | C:

conditioning node creates a reflected path by explaining away

  • A⊥

⊥E | C:

the created path extends to E via D

  • A⊥

⊥E | {C, D}:

but is blocked by observing D So conditioning on (i.e. observing) nodes can both create and remove dependencies.

slide-45
SLIDE 45

The Bayes-ball algorithm

A B C D E Game: can you get a ball from X to Y without being blocked by V? If so, X ⊥

⊥Y|V.

Rules: ball follow edges, and are passed on or bounced back from nodes according to:

slide-46
SLIDE 46

The Bayes-ball algorithm

A B C D E Game: can you get a ball from X to Y without being blocked by V? If so, X ⊥

⊥Y|V.

Rules: ball follow edges, and are passed on or bounced back from nodes according to:

◮ Nodes V /

∈ V pass balls down or up chains: →V → or ←V ←.

slide-47
SLIDE 47

The Bayes-ball algorithm

A B C D E Game: can you get a ball from X to Y without being blocked by V? If so, X ⊥

⊥Y|V.

Rules: ball follow edges, and are passed on or bounced back from nodes according to:

◮ Nodes V /

∈ V pass balls down or up chains: →V → or ←V ←.

◮ Nodes V /

∈ V, bounce balls from children to children.

slide-48
SLIDE 48

The Bayes-ball algorithm

A B D E C Game: can you get a ball from X to Y without being blocked by V? If so, X ⊥

⊥Y|V.

Rules: ball follow edges, and are passed on or bounced back from nodes according to:

◮ Nodes V /

∈ V pass balls down or up chains: →V → or ←V ←.

◮ Nodes V /

∈ V, bounce balls from children to children.

◮ Nodes V ∈ V, bounce balls from parents to parents (including returning the ball

whence it came).

slide-49
SLIDE 49

The Bayes-ball algorithm

A B C D E Game: can you get a ball from X to Y without being blocked by V? If so, X ⊥

⊥Y|V.

Rules: ball follow edges, and are passed on or bounced back from nodes according to:

◮ Nodes V /

∈ V pass balls down or up chains: →V → or ←V ←.

◮ Nodes V /

∈ V, bounce balls from children to children.

◮ Nodes V ∈ V, bounce balls from parents to parents (including returning the ball

whence it came). Otherwise the ball is blocked. (So V ∈ V blocks all balls from children, and stops balls from parents reaching children.)

slide-50
SLIDE 50

D-separation

A B C D E So when is X⊥

⊥Y|V?

slide-51
SLIDE 51

D-separation

A B C D E So when is X⊥

⊥Y|V?

Consider every undirected path (i.e. ignoring arrows) between X and Y. The path is blocked by V if there is a node V on the path such that either:

slide-52
SLIDE 52

D-separation

A B C D E So when is X⊥

⊥Y|V?

Consider every undirected path (i.e. ignoring arrows) between X and Y. The path is blocked by V if there is a node V on the path such that either:

◮ V has convergent arrows (→V ←) on the path (i.e., V is a “collider node”) and neither

V nor its descendents ∈ V.

slide-53
SLIDE 53

D-separation

A B C D E So when is X⊥

⊥Y|V?

Consider every undirected path (i.e. ignoring arrows) between X and Y. The path is blocked by V if there is a node V on the path such that either:

◮ V has convergent arrows (→V ←) on the path (i.e., V is a “collider node”) and neither

V nor its descendents ∈ V.

slide-54
SLIDE 54

D-separation

A B C D E So when is X⊥

⊥Y|V?

Consider every undirected path (i.e. ignoring arrows) between X and Y. The path is blocked by V if there is a node V on the path such that either:

◮ V has convergent arrows (→V ←) on the path (i.e., V is a “collider node”) and neither

V nor its descendents ∈ V.

◮ V does not have convergent arrows on the path (→V → or ←V →) and V ∈ V. This is

similar to the undirected graph semantics.

slide-55
SLIDE 55

D-separation

A B C D E So when is X⊥

⊥Y|V?

Consider every undirected path (i.e. ignoring arrows) between X and Y. The path is blocked by V if there is a node V on the path such that either:

◮ V has convergent arrows (→V ←) on the path (i.e., V is a “collider node”) and neither

V nor its descendents ∈ V.

◮ V does not have convergent arrows on the path (→V → or ←V →) and V ∈ V. This is

similar to the undirected graph semantics. If all paths are blocked, we say V d-separates X from Y (d for directed), and X⊥

⊥Y|V.

slide-56
SLIDE 56

D-separation

A B C D E So when is X⊥

⊥Y|V?

Consider every undirected path (i.e. ignoring arrows) between X and Y. The path is blocked by V if there is a node V on the path such that either:

◮ V has convergent arrows (→V ←) on the path (i.e., V is a “collider node”) and neither

V nor its descendents ∈ V.

◮ V does not have convergent arrows on the path (→V → or ←V →) and V ∈ V. This is

similar to the undirected graph semantics. If all paths are blocked, we say V d-separates X from Y (d for directed), and X⊥

⊥Y|V.

Markov boundary for X: {pa(X) ∪ ch(X) ∪ pa(ch(X))}.

slide-57
SLIDE 57

Expressive power of directed and undirected graphs

A B E D

No DAG can represent these and only these independencies No matter how we direct the arrows there will always be two non-adjacent parents sharing a common child ⇒ dependence in DAG but independence in undirected graph.

A B C

No undirected or factor graph can rep- resent these and only these indepen- dencies One three-way factor, but this does not encode marginal independence.

slide-58
SLIDE 58

Graphs, conditional independencies, and families of distributions

Each graph G implies a set of conditional independence statements C(G) = {Xi⊥

⊥Yi|Vi}.

Each such set C, defines a family of distributions that satisfy all the statements in C:

PC(G) = {P(X) : P(Xi, Yi|Vi) = P(Xi|Vi)P(Yi|Vi) for all Xi⊥ ⊥Yi|Vi in C }

G may also encode a family of distributions by their functional form, e.g. for a factor graph

PG = {P(X) : P(X) = 1

Z

  • j fj(XCj ), for some non-negative functions fj }
slide-59
SLIDE 59

Graphs, conditional independencies, and families of distributions

Each graph G implies a set of conditional independence statements C(G) = {Xi⊥

⊥Yi|Vi}.

Each such set C, defines a family of distributions that satisfy all the statements in C:

PC(G) = {P(X) : P(Xi, Yi|Vi) = P(Xi|Vi)P(Yi|Vi) for all Xi⊥ ⊥Yi|Vi in C }

G may also encode a family of distributions by their functional form, e.g. for a factor graph

PG = {P(X) : P(X) = 1

Z

  • j fj(XCj ), for some non-negative functions fj }

◮ For directed graphs, PG = PC(G).

slide-60
SLIDE 60

Graphs, conditional independencies, and families of distributions

Each graph G implies a set of conditional independence statements C(G) = {Xi⊥

⊥Yi|Vi}.

Each such set C, defines a family of distributions that satisfy all the statements in C:

PC(G) = {P(X) : P(Xi, Yi|Vi) = P(Xi|Vi)P(Yi|Vi) for all Xi⊥ ⊥Yi|Vi in C }

G may also encode a family of distributions by their functional form, e.g. for a factor graph

PG = {P(X) : P(X) = 1

Z

  • j fj(XCj ), for some non-negative functions fj }

◮ For directed graphs, PG = PC(G). ◮ For undirected graphs, PG = PC(G) if all distributions are positive, i.e. P(X) > 0 for all

values of X (Hammersley-Clifford Theorem).

slide-61
SLIDE 61

Graphs, conditional independencies, and families of distributions

Each graph G implies a set of conditional independence statements C(G) = {Xi⊥

⊥Yi|Vi}.

Each such set C, defines a family of distributions that satisfy all the statements in C:

PC(G) = {P(X) : P(Xi, Yi|Vi) = P(Xi|Vi)P(Yi|Vi) for all Xi⊥ ⊥Yi|Vi in C }

G may also encode a family of distributions by their functional form, e.g. for a factor graph

PG = {P(X) : P(X) = 1

Z

  • j fj(XCj ), for some non-negative functions fj }

◮ For directed graphs, PG = PC(G). ◮ For undirected graphs, PG = PC(G) if all distributions are positive, i.e. P(X) > 0 for all

values of X (Hammersley-Clifford Theorem).

◮ There are factor graphs for which PG = PCG.

slide-62
SLIDE 62

Graphs, conditional independencies, and families of distributions

Each graph G implies a set of conditional independence statements C(G) = {Xi⊥

⊥Yi|Vi}.

Each such set C, defines a family of distributions that satisfy all the statements in C:

PC(G) = {P(X) : P(Xi, Yi|Vi) = P(Xi|Vi)P(Yi|Vi) for all Xi⊥ ⊥Yi|Vi in C }

G may also encode a family of distributions by their functional form, e.g. for a factor graph

PG = {P(X) : P(X) = 1

Z

  • j fj(XCj ), for some non-negative functions fj }

◮ For directed graphs, PG = PC(G). ◮ For undirected graphs, PG = PC(G) if all distributions are positive, i.e. P(X) > 0 for all

values of X (Hammersley-Clifford Theorem).

◮ There are factor graphs for which PG = PCG. ◮ Factor graphs are more expressive than undirected graphs: for every undirected graph

G1 there is a factor graph G2 with PG1 = PG2 but not vice versa.

slide-63
SLIDE 63

Graphs, conditional independencies, and families of distributions

Each graph G implies a set of conditional independence statements C(G) = {Xi⊥

⊥Yi|Vi}.

Each such set C, defines a family of distributions that satisfy all the statements in C:

PC(G) = {P(X) : P(Xi, Yi|Vi) = P(Xi|Vi)P(Yi|Vi) for all Xi⊥ ⊥Yi|Vi in C }

G may also encode a family of distributions by their functional form, e.g. for a factor graph

PG = {P(X) : P(X) = 1

Z

  • j fj(XCj ), for some non-negative functions fj }

◮ For directed graphs, PG = PC(G). ◮ For undirected graphs, PG = PC(G) if all distributions are positive, i.e. P(X) > 0 for all

values of X (Hammersley-Clifford Theorem).

◮ There are factor graphs for which PG = PCG. ◮ Factor graphs are more expressive than undirected graphs: for every undirected graph

G1 there is a factor graph G2 with PG1 = PG2 but not vice versa.

◮ Adding edges to graph ⇒ removing conditional independency statements ⇒ enlarging

the family of distributions (converse true for removing edges).

slide-64
SLIDE 64

Graphs, conditional independencies, and families of distributions {p(X) =

i p(Xi|Xpa(i))}

{Xi⊥ ⊥Yi|Vi} {p(X) : p(Xi, Yi|Vi) =p(Xi|Vi)p(Yi|Vi)}

slide-65
SLIDE 65

Tree-structured graphical models

A B C D E F G A B C D E F G Rooted directed tree Directed polytree A B C D E F G A B C D E F G Undirected tree Tree-structured factor graph These are all tree-structured or “singly-connected” graphs.

slide-66
SLIDE 66

Polytrees to tree-structured factor graphs

A B C D E F G

A B C D E F G Polytrees are tree-structured DAGs that may have more than one root. P(X) =

  • i

P(Xi|Xpa(i))

=

  • i

fi(XCi ) where Ci = i ∪ pa(i) and fi(XCi ) = P(Xi|Xpa(i)). Marginal distribution on roots P(Xr) absorbed into an adjacent factor.

slide-67
SLIDE 67

Undirected trees and factor graphs

A B C D E F G

A B C D E F G In an undirected tree all maximal cliques are of size 2, and so the equivalent factor graph has

  • nly pairwise factors.

P(X) = 1 Z

  • edges (ij)

f(ij)(Xi, Xj)

slide-68
SLIDE 68

Rooted directed trees to undirected trees

A B C D E F G

A B C D E F G The distribution for a single-rooted directed tree can be written as a product of pairwise factors ⇒ undirected tree. P(X) = P(Xr)

  • i=r

P(Xi|Xpa(i))

=

  • edges (ij)

f(ij)(Xi, Xj)

slide-69
SLIDE 69

Undirected trees to rooted directed trees

A B C D E F G

A B C D E F G This direction is slightly trickier:

slide-70
SLIDE 70

Undirected trees to rooted directed trees

A B C D E F G

A B C D E F G This direction is slightly trickier:

◮ Choose an arbitrary node Xr to be the root and point all the arrows away from it

slide-71
SLIDE 71

Undirected trees to rooted directed trees

A B C D E F G

A B C D E F G This direction is slightly trickier:

◮ Choose an arbitrary node Xr to be the root and point all the arrows away from it ◮ Compute the marginal distributions on single nodes P(Xi) and on edges P(Xi, Xj)

implied by the undirected graph.

slide-72
SLIDE 72

Undirected trees to rooted directed trees

A B C D E F G

A B C D E F G This direction is slightly trickier:

◮ Choose an arbitrary node Xr to be the root and point all the arrows away from it ◮ Compute the marginal distributions on single nodes P(Xi) and on edges P(Xi, Xj)

implied by the undirected graph.

◮ Compute the conditionals in the DAG:

P(X) = P(Xr)

  • i=r

P(Xi|Xpa(i)) = P(Xr)

  • i=r

P(Xi, Xpa(i)) P(Xpa(i))

=

  • edges (ij) P(Xi, Xj)
  • nodes i P(Xi)deg(i)-1
slide-73
SLIDE 73

Undirected trees to rooted directed trees

A B C D E F G

A B C D E F G This direction is slightly trickier:

◮ Choose an arbitrary node Xr to be the root and point all the arrows away from it ◮ Compute the marginal distributions on single nodes P(Xi) and on edges P(Xi, Xj)

implied by the undirected graph.

◮ Compute the conditionals in the DAG:

P(X) = P(Xr)

  • i=r

P(Xi|Xpa(i)) = P(Xr)

  • i=r

P(Xi, Xpa(i)) P(Xpa(i))

=

  • edges (ij) P(Xi, Xj)
  • nodes i P(Xi)deg(i)-1

How do we compute P(Xi) and P(Xi, Xj)? ⇒ Belief propagation.

slide-74
SLIDE 74

Finding marginals in undirected trees

Xj Xi Undirected tree ⇒ pairwise factored joint distribution: P(X) = 1 Z

  • (ij)∈ET

f(ij)(Xi, Xj)

slide-75
SLIDE 75

Finding marginals in undirected trees

Xj Xi Undirected tree ⇒ pairwise factored joint distribution: P(X) = 1 Z

  • (ij)∈ET

f(ij)(Xi, Xj) Each neigbour Xj of Xi defines a disjoint subtree Tj→i.

slide-76
SLIDE 76

Finding marginals in undirected trees

Xj Xi Undirected tree ⇒ pairwise factored joint distribution: P(X) = 1 Z

  • (ij)∈ET

f(ij)(Xi, Xj) Each neigbour Xj of Xi defines a disjoint subtree Tj→i. So we can split up the product: P(Xi) =

  • X\{Xi}

P(X) ∝

  • X\{Xi}
  • (ij)∈ET

f(ij)(Xi, Xj)

slide-77
SLIDE 77

Finding marginals in undirected trees

Xj Xi Undirected tree ⇒ pairwise factored joint distribution: P(X) = 1 Z

  • (ij)∈ET

f(ij)(Xi, Xj) Each neigbour Xj of Xi defines a disjoint subtree Tj→i. So we can split up the product: P(Xi) =

  • X\{Xi}

P(X) ∝

  • X\{Xi}
  • (ij)∈ET

f(ij)(Xi, Xj)

=

  • X\{Xi}
  • Xj∈ne(Xi)

f(ij)(Xi, Xj)

  • (i′j′)∈ETj→i

f(i′j′)(Xi′, Xj′)

slide-78
SLIDE 78

Finding marginals in undirected trees

Xj Xi Undirected tree ⇒ pairwise factored joint distribution: P(X) = 1 Z

  • (ij)∈ET

f(ij)(Xi, Xj) Each neigbour Xj of Xi defines a disjoint subtree Tj→i. So we can split up the product: P(Xi) =

  • X\{Xi}

P(X) ∝

  • X\{Xi}
  • (ij)∈ET

f(ij)(Xi, Xj)

=

  • X\{Xi}
  • Xj∈ne(Xi)

f(ij)(Xi, Xj)

  • (i′j′)∈ETj→i

f(i′j′)(Xi′, Xj′)

=

  • Xj∈ne(Xi)

XTj→i

f(ij)(Xi, Xj)

  • (i′j′)∈ETj→i

f(i′j′)(Xi′, Xj′)

  • Mj→i(Xi)
slide-79
SLIDE 79

Finding marginals in undirected trees

Xj Xi Undirected tree ⇒ pairwise factored joint distribution: P(X) = 1 Z

  • (ij)∈ET

f(ij)(Xi, Xj) Each neigbour Xj of Xi defines a disjoint subtree Tj→i. So we can split up the product: P(Xi) =

  • X\{Xi}

P(X) ∝

  • X\{Xi}
  • (ij)∈ET

f(ij)(Xi, Xj)

=

  • X\{Xi}
  • Xj∈ne(Xi)

f(ij)(Xi, Xj)

  • (i′j′)∈ETj→i

f(i′j′)(Xi′, Xj′)

=

  • Xj∈ne(Xi)

XTj→i

f(ij)(Xi, Xj)

  • (i′j′)∈ETj→i

f(i′j′)(Xi′, Xj′)

  • Mj→i(Xi)

=

  • Xj∈ne(Xi)

Mj→i(Xi)

slide-80
SLIDE 80

Message recursion: Belief Propagation (BP)

Xj Xi Mj→i(Xi)

slide-81
SLIDE 81

Message recursion: Belief Propagation (BP)

Xj Xi Mj→i(Xi) =

  • XTj→i

f(ij)(Xi, Xj)

  • (i′j′)∈ETj→i

f(i′j′)(Xi′, Xj′)

slide-82
SLIDE 82

Message recursion: Belief Propagation (BP)

Xj Xi Mj→i(Xi) =

  • XTj→i

f(ij)(Xi, Xj)

  • (i′j′)∈ETj→i

f(i′j′)(Xi′, Xj′)

=

  • Xj

f(ij)(Xi, Xj)

  • XTj→i \Xj
  • (i′j′)∈ETj→i

f(i′j′)(Xi′, Xj′)

slide-83
SLIDE 83

Message recursion: Belief Propagation (BP)

Xj Mj→i(Xi) =

  • XTj→i

f(ij)(Xi, Xj)

  • (i′j′)∈ETj→i

f(i′j′)(Xi′, Xj′)

=

  • Xj

f(ij)(Xi, Xj)

  • XTj→i \Xj
  • (i′j′)∈ETj→i

f(i′j′)(Xi′, Xj′)

  • ∝ PTj→i (Xj)
slide-84
SLIDE 84

Message recursion: Belief Propagation (BP)

Xj Mj→i(Xi) =

  • XTj→i

f(ij)(Xi, Xj)

  • (i′j′)∈ETj→i

f(i′j′)(Xi′, Xj′)

=

  • Xj

f(ij)(Xi, Xj)

  • XTj→i \Xj
  • (i′j′)∈ETj→i

f(i′j′)(Xi′, Xj′)

  • ∝ PTj→i (Xj) ∝
  • Xk ∈ne(Xj)\Xi

Mk→j(Xj)

slide-85
SLIDE 85

Message recursion: Belief Propagation (BP)

Xj Xi Mj→i(Xi) =

  • XTj→i

f(ij)(Xi, Xj)

  • (i′j′)∈ETj→i

f(i′j′)(Xi′, Xj′)

=

  • Xj

f(ij)(Xi, Xj)

  • XTj→i \Xj
  • (i′j′)∈ETj→i

f(i′j′)(Xi′, Xj′)

  • ∝ PTj→i (Xj) ∝
  • Xk ∈ne(Xj)\Xi

Mk→j(Xj)

=

  • Xj

f(ij)(Xi, Xj)

  • Xk ∈ne(Xj)\Xi

Mk→j(Xj)

slide-86
SLIDE 86

BP for pairwise marginals in undirected trees

Xj Xi P(Xi, Xj) =

  • X\{Xi,Xj}

P(X) ∝

  • X\{Xi,Xj}
  • (ij)∈ET

f(ij)(Xi, Xj)

slide-87
SLIDE 87

BP for pairwise marginals in undirected trees

Xj Xi P(Xi, Xj) =

  • X\{Xi,Xj}

P(X) ∝

  • X\{Xi,Xj}
  • (ij)∈ET

f(ij)(Xi, Xj)

=

  • X\{Xi,Xj}

f(ij)(Xi, Xj)

  • (i′j′)∈ETj→i

f(i′j′)(Xi′, Xj′)

  • (i′j′)∈ETi→j

f(i′j′)(Xi′, Xj′)

slide-88
SLIDE 88

BP for pairwise marginals in undirected trees

Xj Xi P(Xi, Xj) =

  • X\{Xi,Xj}

P(X) ∝

  • X\{Xi,Xj}
  • (ij)∈ET

f(ij)(Xi, Xj)

=

  • X\{Xi,Xj}

f(ij)(Xi, Xj)

  • (i′j′)∈ETj→i

f(i′j′)(Xi′, Xj′)

  • (i′j′)∈ETi→j

f(i′j′)(Xi′, Xj′)

= f(ij)(Xi, Xj)

  • XTj→i \Xj
  • (i′j′)∈ETj→i

f(i′j′)(Xi′, Xj′)

  • XTi→j \Xi
  • (i′j′)∈ETi→j

f(i′j′)(Xi′, Xj′)

slide-89
SLIDE 89

BP for pairwise marginals in undirected trees

Xj Xi P(Xi, Xj) =

  • X\{Xi,Xj}

P(X) ∝

  • X\{Xi,Xj}
  • (ij)∈ET

f(ij)(Xi, Xj)

=

  • X\{Xi,Xj}

f(ij)(Xi, Xj)

  • (i′j′)∈ETj→i

f(i′j′)(Xi′, Xj′)

  • (i′j′)∈ETi→j

f(i′j′)(Xi′, Xj′)

= f(ij)(Xi, Xj)

  • XTj→i \Xj
  • (i′j′)∈ETj→i

f(i′j′)(Xi′, Xj′)

  • XTi→j \Xi
  • (i′j′)∈ETi→j

f(i′j′)(Xi′, Xj′)

  • = f(ij)(Xi, Xj)
  • Xk ∈ne(Xj)\Xi

Mk→j(Xj)

  • Xk ∈ne(Xi)\Xj

Mk→i(Xi)

slide-90
SLIDE 90

BP for inference

Xi Xa Messages from observed leaf nodes are conditioned rather than marginalised:

slide-91
SLIDE 91

BP for inference

Xi Xa Messages from observed leaf nodes are conditioned rather than marginalised: To compute P(Xi) : Ma→i =

  • Xa

fai(Xa, Xi)

slide-92
SLIDE 92

BP for inference

Xi Xa Messages from observed leaf nodes are conditioned rather than marginalised: To compute P(Xi) : Ma→i =

  • Xa

fai(Xa, Xi) To compute P(Xi|Xa = a) : Ma→i= fai(Xa = a, Xi)

slide-93
SLIDE 93

BP for inference

Xi Xa Xb Xj Xk Messages from observed leaf nodes are conditioned rather than marginalised: To compute P(Xi) : Ma→i =

  • Xa

fai(Xa, Xi) To compute P(Xi|Xa = a) : Ma→i= fai(Xa = a, Xi) Observed internal nodes partition the graph, and so messages propagate independently. Mb→j= fbj(Xb = b, Xj) Mb→k= fbk(Xb = b, Xk)

slide-94
SLIDE 94

BP for inference

Xi Xa Xb Xj Xk Messages from observed leaf nodes are conditioned rather than marginalised: To compute P(Xi) : Ma→i =

  • Xa

fai(Xa, Xi) To compute P(Xi|Xa = a) : Ma→i= fai(Xa = a, Xi) Observed internal nodes partition the graph, and so messages propagate independently. Mb→j= fbj(Xb = b, Xj) Mb→k= fbk(Xb = b, Xk) Messages Mi→j are proportional to the likelihood based on any observed variables (O) within the messages subtree Ti→j, possibly scaled by a prior factor (depending on factorisation) Mi→j(Xj) ∝ P(XTi→j ∩ O|Xi)P(Xi)

slide-95
SLIDE 95

BP for latent chain models

s1 s2 s3 sT x1 x2 x3 xT

  • • •

A latent chain model is a rooted directed tree ⇒ an undirected tree. The forward-backward algorithm is just BP on this graph.

αt(i) ⇔ Mst−1→st (st=i) ∝ P(x1:t, st) βt(i) ⇔ Mst+1→st (st=i) ∝ P(xt+1:T|st) αt(i)βt(i) =

  • j∈ne(st )

Mj→st (st=i) ∝ P(st=i|O) Algorithms like BP extend the power of graphical models beyond just encoding of independence and factorisation. A single derivation serves for a wide array of models.

slide-96
SLIDE 96

BP in non-trees?

D A C E B Can we find P(D) easily?

slide-97
SLIDE 97

BP in non-trees?

D A C E B Can we find P(D) easily?

◮ Neighbours do not belong to disjoint subtrees, so influence of other nodes cannot be

separated into messages.

slide-98
SLIDE 98

BP in non-trees?

D A C E B Can we find P(D) easily?

◮ Neighbours do not belong to disjoint subtrees, so influence of other nodes cannot be

separated into messages.

◮ Observed nodes may break loops and make subtrees independent,

slide-99
SLIDE 99

BP in non-trees?

D A C E B Can we find P(D) easily?

◮ Neighbours do not belong to disjoint subtrees, so influence of other nodes cannot be

separated into messages.

◮ Observed nodes may break loops and make subtrees independent, but may not resolve

all loops.

slide-100
SLIDE 100

BP in non-trees?

D A C E B Can we find P(D) easily?

◮ Neighbours do not belong to disjoint subtrees, so influence of other nodes cannot be

separated into messages.

◮ Observed nodes may break loops and make subtrees independent, but may not resolve

all loops. Possible strategies:

slide-101
SLIDE 101

BP in non-trees?

D A C E B Can we find P(D) easily?

◮ Neighbours do not belong to disjoint subtrees, so influence of other nodes cannot be

separated into messages.

◮ Observed nodes may break loops and make subtrees independent, but may not resolve

all loops. Possible strategies:

◮ Propagate local messages anyway, and hope for the best

◮ loopy belief propagation — actually an approximation which we will study later.

slide-102
SLIDE 102

BP in non-trees?

D A C E B Can we find P(D) easily?

◮ Neighbours do not belong to disjoint subtrees, so influence of other nodes cannot be

separated into messages.

◮ Observed nodes may break loops and make subtrees independent, but may not resolve

all loops. Possible strategies:

◮ Propagate local messages anyway, and hope for the best

◮ loopy belief propagation — actually an approximation which we will study later.

◮ Group variables together into multivariate nodes until the resulting graph is a tree.

◮ Junction tree

slide-103
SLIDE 103

Graph transformations

For exact inference in arbitrary graphical models we need to transform the given graph into

  • ne that will be easier to handle (specifically a tree: the junction or join tree).
slide-104
SLIDE 104

Graph transformations

For exact inference in arbitrary graphical models we need to transform the given graph into

  • ne that will be easier to handle (specifically a tree: the junction or join tree).

The original graph G encoded a distribution P(X) with a certain factorisation or independence structure.

slide-105
SLIDE 105

Graph transformations

For exact inference in arbitrary graphical models we need to transform the given graph into

  • ne that will be easier to handle (specifically a tree: the junction or join tree).

The original graph G encoded a distribution P(X) with a certain factorisation or independence structure.

◮ Transformation from G to an easy-to-handle G′ will only be valid if P(X) can also be

represented by G′.

slide-106
SLIDE 106

Graph transformations

For exact inference in arbitrary graphical models we need to transform the given graph into

  • ne that will be easier to handle (specifically a tree: the junction or join tree).

The original graph G encoded a distribution P(X) with a certain factorisation or independence structure.

◮ Transformation from G to an easy-to-handle G′ will only be valid if P(X) can also be

represented by G′.

◮ This can be ensured if every step of the graph transformation only removes conditional

independencies, never adds them.

slide-107
SLIDE 107

Graph transformations

For exact inference in arbitrary graphical models we need to transform the given graph into

  • ne that will be easier to handle (specifically a tree: the junction or join tree).

The original graph G encoded a distribution P(X) with a certain factorisation or independence structure.

◮ Transformation from G to an easy-to-handle G′ will only be valid if P(X) can also be

represented by G′.

◮ This can be ensured if every step of the graph transformation only removes conditional

independencies, never adds them.

◮ Thus the family of possible encoded distributions grows or stays the same at each step,

and P(X) will be in the family of distributions represented by G′.

slide-108
SLIDE 108

Graph transformations

For exact inference in arbitrary graphical models we need to transform the given graph into

  • ne that will be easier to handle (specifically a tree: the junction or join tree).

The original graph G encoded a distribution P(X) with a certain factorisation or independence structure.

◮ Transformation from G to an easy-to-handle G′ will only be valid if P(X) can also be

represented by G′.

◮ This can be ensured if every step of the graph transformation only removes conditional

independencies, never adds them.

◮ Thus the family of possible encoded distributions grows or stays the same at each step,

and P(X) will be in the family of distributions represented by G′.

◮ The factor potentials on the new graph G′ are built from those given on G, so as to

encode the same distribution.

slide-109
SLIDE 109

Graph transformations

For exact inference in arbitrary graphical models we need to transform the given graph into

  • ne that will be easier to handle (specifically a tree: the junction or join tree).

The original graph G encoded a distribution P(X) with a certain factorisation or independence structure.

◮ Transformation from G to an easy-to-handle G′ will only be valid if P(X) can also be

represented by G′.

◮ This can be ensured if every step of the graph transformation only removes conditional

independencies, never adds them.

◮ Thus the family of possible encoded distributions grows or stays the same at each step,

and P(X) will be in the family of distributions represented by G′.

◮ The factor potentials on the new graph G′ are built from those given on G, so as to

encode the same distribution.

◮ Then inference on G′ with the appropriate potentials acts on P(X).

slide-110
SLIDE 110

The Junction Tree algorithm

C A B D E C A B D E C A B D E C A B D E

AB E BC E CD E BE EC AB E BC E CD E BE EC directed acyclic graph factor graph undirected graph chordal or triangulated undirected graph junction tree message passing

slide-111
SLIDE 111

DAG to factor graph

C A B D E C A B D E

Factors are simply the conditional distributions in the DAG. P(X) =

  • i

P(Xi|Xpa(i))

=

  • i

fi(XCi ) where Ci = i ∪ pa(i) and fi(XCi ) = P(Xi|Xpa(i)). Marginal distribution on root(s) P(Xr) absorbed into an adjacent factor.

slide-112
SLIDE 112

Observations in a factor graph

C A B D E C A B D E

Inference usually targets a posterior marginal given a set of observed values P(XI|O) e.g. P(A|D = wet, C = rain). Formally, we can either modify the factors linked to observed nodes, or add singleton factors adjacent to the observed nodes, e.g. fD(D) =

  • 1

if D =wet;

  • therwise.

fC(C) =

  • 1

if C =rain;

  • therwise.
slide-113
SLIDE 113

Factor graph to undirected graph

C A B D E C A B D E

The next step (triangulation) will depend on an undirected graph. Every factor from the DAG must be contained within a maximal clique of the undirected graph.

◮ Replace each factor by an undirected clique (i.e. place edge between every pair of

nodes in the factor).

◮ Construct the potentials on each maximal clique by multiplying together factor potentials

that fall within it; ensuring each factor potential only appears once. The transformation from DAG ⇒ undirected graph is called moralization:

◮ “marry” all parents of each node by adding an edge to connect them ◮ drop arrows on all the edges

slide-114
SLIDE 114

Triangulating the undirected graph

C A B D E C A B D E

The undirected graph may have loops, which would interfere with belief propagation.

◮ Could join loops into cliques, but this is inefficient. ◮ Triangulation: add edges to the graph so every loop of size ≥ 4 has at least one chord. ◮ Recursive: new edges may create new loops; ensure new loops ≥ 4 have chords too. ◮ An undirected graph where every loop of size ≥ 4 has at least one chord is called

chordal or triangulated.

◮ Adding edges always removes conditional independencies, enlarging the family of

distributions that the graph can encode.

◮ Many ways to add chords; in general finding the best triangulation is NP-complete. ◮ One approach: variable elimination.

slide-115
SLIDE 115

Variable elimination

Imagine marginalising the distribution one variable at a time (eliminating each from the graph). Let the order of elimination be Xσ(1), Xσ(2), . . . , Xσ(n): P(Xσ(n)) =

  • Xσ(n−1)

· · ·

  • Xσ(1)

P(X) = 1 Z

  • Xσ(n−1)

· · ·

  • Xσ(2)
  • Xσ(1)
  • i

fi(XCi )

= 1

Z

  • Xσ(n−1)

· · ·

  • Xσ(2)
  • j:Ci∋σ(1)

fj(XCj )

  • Xσ(1)
  • i:Ci∋σ(1)

fi(XCi )

= 1

Z

  • Xσ(n−1)

· · ·

  • Xσ(2)
  • j:Ci∋σ(1)

fj(XCj )fnew(XCnew) where Cnew = ne(Xσ(1)), and edges are added to the graph connecting all nodes in Cnew.

slide-116
SLIDE 116

Variable elimination

Theorem: a graph including all edges that would be induced by elimination is chordal. Finding a good triangulation depends on finding a good order of elimination σ(1), . . . , σ(n). This is also NP-complete. Heuristics: pick next variable to eliminate by

◮ Minimum deficiency search: choose variable that induces the fewest new edges. ◮ Maximum cardinality search: choose variable with most previously visited neighbours.

Minimum deficiency search seems (empirically) to be better.

slide-117
SLIDE 117

Triangulation may not be obvious by inpection

Is this graph triangulated?

slide-118
SLIDE 118

Triangulation may not be obvious by inpection

Is this graph triangulated? No.

slide-119
SLIDE 119

Triangulation may not be obvious by inpection

Is this graph triangulated?

  • No. Chords must be direct connections — they cannot step through an intermediate node.
slide-120
SLIDE 120

Triangulation may not be obvious by inpection

Is this graph triangulated?

  • No. Chords must be direct connections — they cannot step through an intermediate node.

Detecting unchorded loops by inspection rapidly becomes difficult in large graphs, necessitating automated algorithms such as variable elimination.

slide-121
SLIDE 121

Chordal graph to the junction tree C A B D E

AB E BC E CD E BE EC A junction tree (or join tree) is a tree whose nodes and edges are labelled with sets of variables. Each node represents a clique; edges are labelled by the intersections of cliques, called separators.

◮ Cliques contain all adjacent separators. ◮ Running intersection property: if two cliques contain variable X, all cliques and

separators on the path between the two cliques contain X. The running intersection property is required for consistency.

slide-122
SLIDE 122

Chordal graph to the junction tree

C A B D E

AB E BC E CD E BE EC AB E BC E CD E BE EC E

◮ Find the maximal cliques C1, . . . , Ck of the chordal undirected graph (each clique

consists of an eliminated variable and its neighbours, so finding maximal cliques is easy).

◮ Construct a weighted graph, with nodes labelled by the maximal cliques and edges

connecting each pair of cliques that shares variables (labelled by the variables in the intersection).

◮ Define the weight of an edge to be the size of the separator. ◮ Find the maximum-weight spanning tree of the weighted graph.

slide-123
SLIDE 123

Messages on the junction tree

A B C D E

BCE ABE CDE BE CE We’ve now completed the transformation from a general model to a tree-structured graph.

◮ Belief propagation on the junction tree should allow us to efficiently compute posterior

marginals for inference and learning.

slide-124
SLIDE 124

Recall: BP on undirected trees

Xj Xi Mj→i(Xi) =

  • XTj→i

f(ij)(Xi, Xj)

  • (i′j′)∈ETj→i

f(i′j′)(Xi′, Xj′)

=

  • Xj

f(ij)(Xi, Xj)

  • XTj→i \Xj
  • (i′j′)∈ETj→i

f(i′j′)(Xi′, Xj′)

  • ∝ PTj→i (Xj) ∝
  • Xk ∈ne(Xj)\Xi

Mk→j(Xj)

=

  • Xj

f(ij)(Xi, Xj)

  • Xk ∈ne(Xj)\Xi

Mk→j(Xj)

slide-125
SLIDE 125

Message passing on junction trees

ABC BCD BC Maximal cliques in the chordal graph are nodes of the junction tree. Thus, the joint distribution factors over the JT nodes: P(X) = 1 Z

  • i

fi(XCi ) = . . . fABC(A, B, C)fBCD(B, C, D) . . . This appears to violate the usual undirected tree semantics of a factor per edge.

slide-126
SLIDE 126

Message passing on junction trees

ABC BCD BC Maximal cliques in the chordal graph are nodes of the junction tree. Thus, the joint distribution factors over the JT nodes: P(X) = 1 Z

  • i

fi(XCi ) = . . . fABC(A, B, C)fBCD(B, C, D) . . . This appears to violate the usual undirected tree semantics of a factor per edge. However: the appearance of the same variables in multiple nodes introduces dependencies:

slide-127
SLIDE 127

Message passing on junction trees

A(1)B(1)C B(2)C(2)D BC Maximal cliques in the chordal graph are nodes of the junction tree. Thus, the joint distribution factors over the JT nodes: P(X) = 1 Z

  • i

fi(XCi ) = . . . fABC(A, B, C)fBCD(B, C, D) . . . This appears to violate the usual undirected tree semantics of a factor per edge. However: the appearance of the same variables in multiple nodes introduces dependencies:

◮ Introduce copy variables on each side of the separator.

slide-128
SLIDE 128

Message passing on junction trees

A(1)B(1)C B(2)C(2)D BC Maximal cliques in the chordal graph are nodes of the junction tree. Thus, the joint distribution factors over the JT nodes: P(X) = 1 Z

  • i

fi(XCi ) = . . . fABC(A, B, C)fBCD(B, C, D) . . . This appears to violate the usual undirected tree semantics of a factor per edge. However: the appearance of the same variables in multiple nodes introduces dependencies:

◮ Introduce copy variables on each side of the separator. ◮ Factors on nodes no longer overlap.

P(X) = . . . fABC(A, B(1), C(1)) fBCD(B(2), C(2), D) . . .

slide-129
SLIDE 129

Message passing on junction trees

A(1)B(1)C B(2)C(2)D fsep Maximal cliques in the chordal graph are nodes of the junction tree. Thus, the joint distribution factors over the JT nodes: P(X) = 1 Z

  • i

fi(XCi ) = . . . fABC(A, B, C)fBCD(B, C, D) . . . This appears to violate the usual undirected tree semantics of a factor per edge. However: the appearance of the same variables in multiple nodes introduces dependencies:

◮ Introduce copy variables on each side of the separator. ◮ Factors on nodes no longer overlap. ◮ New delta-function factors on separators enforce consistency amongst copies:

P(X) = . . . fABC(A, B(1), C(1)) δ(B(1) − B(2))δ(C(1) − C(2))

  • fsep(B(1), C(1), B(2), C(2))

fBCD(B(2), C(2), D) . . .

slide-130
SLIDE 130

Message passing on junction trees

XCi XCj XCk XCl XSki XSli XSij We can use this view to define BP messages on the junction tree:

slide-131
SLIDE 131

Message passing on junction trees

XCi XCj XCk XCl XSki XSli XSij We can use this view to define BP messages on the junction tree:

◮ Copy and partition clique variables XCi :

slide-132
SLIDE 132

Message passing on junction trees

X (-)

Ci

XCj XCk XCl XSki XSli XSij We can use this view to define BP messages on the junction tree:

◮ Copy and partition clique variables XCi :

◮ Unshared variables: X (-)

Ci = XCi\ Sik

slide-133
SLIDE 133

Message passing on junction trees

X (-)

Ci

X (2)

Ski

X (2)

Sli

XCj XCk XCl X (1)

Ski X (2) Ski

X (1)

Sli X (2) Sli

XSij We can use this view to define BP messages on the junction tree:

◮ Copy and partition clique variables XCi :

◮ Unshared variables: X (-)

Ci = XCi\ Sik

◮ Variables in incoming separators: X (2)

Ski (matching variables X (1) Ski in k ∈ ne(i)\j).

slide-134
SLIDE 134

Message passing on junction trees

X (-)

Ci X (1) Sij

X (2)

Ski

X (2)

Sli

XCj XCk XCl X (1)

Ski X (2) Ski

X (1)

Sli X (2) Sli

X (1)

Sij X (2) Sij

We can use this view to define BP messages on the junction tree:

◮ Copy and partition clique variables XCi :

◮ Unshared variables: X (-)

Ci = XCi\ Sik

◮ Variables in incoming separators: X (2)

Ski (matching variables X (1) Ski in k ∈ ne(i)\j).

◮ Variables in outgoing separator: X (1)

Sij (matching variables X (2) Sij in clique j).

slide-135
SLIDE 135

Message passing on junction trees

X (-)

Ci X (1) Sij

X (2)

Ski

X (2)

Sli

XCj XCk XCl X (1)

Ski X (2) Ski

X (1)

Sli X (2) Sli

X (1)

Sij X (2) Sij

We can use this view to define BP messages on the junction tree:

◮ Copy and partition clique variables XCi :

◮ Unshared variables: X (-)

Ci = XCi\ Sik

◮ Variables in incoming separators: X (2)

Ski (matching variables X (1) Ski in k ∈ ne(i)\j).

◮ Variables in outgoing separator: X (1)

Sij (matching variables X (2) Sij in clique j).

◮ (Variables that appear in more than one separator will need additional copies.)

slide-136
SLIDE 136

Message passing on junction trees

X (-)

Ci X (1) Sij

X (2)

Ski

X (2)

Sli

XCj XCk XCl X (1)

Ski X (2) Ski

X (1)

Sli X (2) Sli

X (1)

Sij X (2) Sij

Messages become: Mi→j(X (2)

Sij ) =

  • X(-)

Ci ,{X(2) Ski },X(1) Sij

fi(X (-)

Ci , {X (2) Ski }, X (1) Sij )fij(X (1) Sij , X (2) Sij )

  • k∈ne(i)\j

Mk→i(X (2)

Ski )

=

  • XCi \Sij

fi(XCi )

  • k∈ne(i)\j

Mk→i(XSki )

slide-137
SLIDE 137

Shafer-Shenoy propagation

Ck Ci Cj Ski Sij Cl Sli

Messages computed recursively by: Mi→j(XSij ) =

  • XCi \Sij

fi(XCi )

  • k∈ne(i)\j

Mk→i(XSki ) And marginal distributions on cliques and separators are: P(XCi ) = fi(XCi )

  • k∈ne(i)

Mk→i(XSki ) P(XSij ) = Mi→j(XSij )Mj→i(XSij ) This is called Shafer-Shenoy propagation.

slide-138
SLIDE 138

Consistency

The running intersection property and tree structure of the junction trees implies that local consistency between cliques and separator marginals guarantees global consistency. If qi(XCi ), rij(XSij ) are distributions such that

  • XCi \Sij

qi(XCi ) = rij(XSij ) Then the following P(X) =

  • cliques i

qi(XCi )

  • separators (ij)

rij(XSij )

Ck Ci Cj Ski Sij Cl Sli

is also a distribution (non-negative and sums to one) such that: qi(XCi ) =

  • X\XCi

P(X) rij(XSij ) =

  • X\XSij

P(X)

slide-139
SLIDE 139

Reparameterisation for message passing

Ck Ci Cj Ski Sij Cl Sli

Hugin propagation is a different (but equivalent) message passing algorithm. It is based upon the idea of reparameterisation. Initialize: qi(XCi ) ∝ fi(XCi ) rij(XSij ) ∝ 1 Then our probability distribution is initially P(X) ∝

  • cliques i qi(XCi )
  • separators (ij) rij(XSij )

A Hugin propagation update for i → j is: r new

ij

(XSij ) =

  • XCi \Sij

qi(XCi ) qnew

j

(XCj ) = qj(XCj )

r new

ij

(XSij )

rij(XSij )

slide-140
SLIDE 140

Hugin propagation

Ck Ci Cj Ski Sij Cl Sli

Some properties of Hugin propagation:

◮ The defined distribution P(X) is unchanged by the updates. ◮ Each update introduces a local consistency constaint:

  • XCj \Sij

qj(XCj ) = rij(XSij )

◮ If each update i → j is carried out only after incoming updates k → i have been carried

  • ut, then each update needs only be carried out once.

◮ Each Hugin update is equivalent to the corresponding Shafer-Shenoy update.

slide-141
SLIDE 141

Computational Costs of the Junction Tree Algorithm

◮ Most of the computational cost of the junction tree algorithm is incurred during the

message passing phase.

◮ The running and memory costs of the message passing phase are both O( i |XCi |).

This can be significantly (exponentially) more efficient than brute force.

◮ The variable elimination ordering heuristic can have very significant impact on the

message passing costs.

◮ For certain classes of graphical models (e.g. 2D lattice Markov random field) it is

possible to hand-craft an efficient ordering.

slide-142
SLIDE 142

Other Inference Algorithms

There are other approaches to inference in graphical models which may be more efficient under specific conditions: Cutset conditioning: or “reasoning by assumptions”. Find a small set of variables which, if they were given (i.e. known) would render the remaining graph “simpler”. For each value of these variables run some inference algorithm on the simpler graph, and average the resulting beliefs with the appropriate weights. Loopy belief propagation: just use belief propagation even though there are loops. No guarantee of convergence, but often works well in practice. Some (weak) guarantees about the nature of the answer if the message passing does converge. Second half of course: we will learn about a variety of approximate inference algorithms when the graphical model is so large/complex that no exact inference algorithm can work efficiently.

slide-143
SLIDE 143

Learning in Graphical Models

We have discussed inference at length — what about learning? The factored structure implied by the graph also makes learning easy. Consider data points comprising observations of a subset of variables. ML learning ⇒ adjust parameters to maximise:

L = P(Xobs|θ) =

  • Xunobs

P(Xobs, Xunobs|θ) by EM, need to maximise

F(q, θ) = log P(Xobs, Xunobs|θ) − log q(Xunobs)q(Xunobs) =

  • i log fi(XCi |θi) − log Z(θ)
  • q(Xunobs) + H(q)

=

  • i

log fi(XCi |θi)q(XCi \Xobs) − log Z(θ) + H(q)

A B C D E

So learning only requires posterior marginals on cliques (obtained by messaging passing) and updates on cliques; c.f. the Baum-Welch procedure for HMMs.