Graphical Models Graphical Models Loopy BP and Bethe Free Energy - - PowerPoint PPT Presentation

graphical models graphical models
SMART_READER_LITE
LIVE PREVIEW

Graphical Models Graphical Models Loopy BP and Bethe Free Energy - - PowerPoint PPT Presentation

Graphical Models Graphical Models Loopy BP and Bethe Free Energy Siamak Ravanbakhsh Winter 2018 Learning objective Learning objective loopy belief propagation its variational derivation: Bethe approximation So far... So far... exact


slide-1
SLIDE 1

Graphical Models Graphical Models

Loopy BP and Bethe Free Energy

Siamak Ravanbakhsh Winter 2018

slide-2
SLIDE 2

Learning objective Learning objective

loopy belief propagation its variational derivation: Bethe approximation

slide-3
SLIDE 3

So far... So far...

exact inference: variable elimination equivalent to belief propagation (BP) in a clique tree

slide-4
SLIDE 4

So far... So far...

exact inference: variable elimination equivalent to belief propagation (BP) in a clique tree what if the exact inference is too expensive? (i.e., the tree-width is large) continue to use BP: loopy BP why is this a good idea? answer using variational interpretation

This class... This class...

slide-5
SLIDE 5

Recap Recap: BP in clique trees : BP in clique trees

sum-product BP message update:

δ (S ) = ψ (C ) δ (S )

i→j i,j

∑C −S

i i,j

i i ∏k∈Nb −j

i

k→i i,k

from leaves towards the root back to leaves

sepset cluster/clique

slide-6
SLIDE 6

Recap Recap: BP in clique trees : BP in clique trees

sum-product BP message update:

δ (S ) = ψ (C ) δ (S )

i→j i,j

∑C −S

i i,j

i i ∏k∈Nb −j

i

k→i i,k

p (C ) ∝ β (C ) = ψ (C ) δ (S )

i i i i i i ∏k∈Nbi k→i i,k

from leaves towards the root back to leaves marginal (belief) for each cluster:

sepset cluster/clique

slide-7
SLIDE 7

Clique-tree for Clique-tree for tree structures tree structures

pairwise potentials tree width = 1

ϕ (x , x )

i,j i j

x2 x4

  • ne cluster per factor

x1 x3 x5 x6

  • ne possible clique-tree

what are the sepsets?

slide-8
SLIDE 8

Clique-tree for Clique-tree for tree structures tree structures

pairwise potentials tree width = 1

ϕ (x , x )

i,j i j

x2 x4

  • ne cluster per factor

x1 x3 x5 x6

  • ne possible clique-tree

what are the sepsets? a different valid clique-tree check for running intersection property

slide-9
SLIDE 9

BP for BP for tree structures tree structures

pairwise potentials message update

δ (x ) = ϕ (x , x ) δ (x )

i→j j

∑xi

i,j i j ∏k∈Nb −j

i

k→i i

from leaves towards a root back to leaves

ϕ (x , x )

i,j i j

xi xj

  • ne cluster per factor
slide-10
SLIDE 10

BP for BP for tree structures tree structures

pairwise potentials message update

δ (x ) = ϕ (x , x ) δ (x )

i→j j

∑xi

i,j i j ∏k∈Nb −j

i

k→i i

p (x ) ∝ δ (x )

i i

∏k∈Nbi

k→i i

from leaves towards a root back to leaves

marginal (belief) for each cluster

ϕ (x , x )

i,j i j

xi xj

  • ne cluster per factor

p (x , x ) ∝ ϕ (x , x ) δ (x ) δ (x )

i,j i j i,j i j ∏k∈Nb −j

i

k→i i ∏k∈Nb −i

j

k→j j

slide-11
SLIDE 11

BP for tree structures: BP for tree structures: reparametrization reparametrization

graphical model represents why is this correct?

the denominator is adjusting for double-counts substitute the marginals using BP messages to get (*)

p(x) = ϕ (x , x )

z 1 ∏i,j∈E i,j i j

  • ne cluster per factor

write it in terms of marginals

p(x) =

p ∏i

i ∣Nb ∣−1 i

p (x ,x ) ∏i,j∈E

i,j i j

*

slide-12
SLIDE 12

Variational Variational interpretation interpretation

write q in terms of marginals of interest arg min D(q∥p)

q

BP as I-projection

p(x) = ϕ (x , x )

Z 1 ∏k i,j i j

q(x) =

q (x ) ∏i

i i ∣Nb ∣−1 i

q (x ,x ) ∏i,j∈E

i,j i j

minimization gives us the marginals q

, q

i,j i

slide-13
SLIDE 13

Variational Variational free energy free energy

D(q∥p) = q(x)(ln q(x) − ln p(x)) ∑x

E [ ln ϕ (x , x )] − ln(Z)

q ∑i,j i,j i j

−H(q) = −H(q) − E [ ln ϕ (x , x )] + ln Z

q ∑i,j i,j i j

I-projection is equivalent to arg max H(q) + E [ ln ϕ (x , x )]

q q ∑i,j i,j i j

variational free energy

ignore: does not depend on q

free energy is a lower-bound on ln Z

slide-14
SLIDE 14

Simplifying the free energy Simplifying the free energy

arg min D(q∥p)

q

p(x) = ϕ (x , x )

Z 1 ∏k i,j i j

≡ arg max H(q) + E [ ln ϕ (x , x )]

q q ∑i,j i,j i j

so far did not use the decomposed form of q

both entropy and energy involve summation over exponentially many terms

q(x) =

q (x ) ∏i

i i ∣Nb ∣−1 i

q (x ,x ) ∏i,j∈E

i,j i j

slide-15
SLIDE 15

Simplifying Simplifying the free energy the free energy

arg min D(q∥p)

q

p(x) = ϕ (x , x )

Z 1 ∏k i,j i j

≡ arg max H(q) + E [ ln ϕ (x , x )]

q q ∑i,j i,j i j

H(q ) − (∣Nb ∣ − 1)H(q ) ∑i,j∈E

i,j

∑i

i i

follows from the decomposition of q

q (x , x ) ln ϕ (x , x ) ∑i,j∈E ∑xi,j

i,j i j i,j i j

q(x) =

q (x ) ∏i

i i ∣Nb ∣−1 i

q (x ,x ) ∏i,j∈E

i,j i j

slide-16
SLIDE 16

Variational interpretation: Variational interpretation: marginal constraints marginal constraints

arg max H(q) + E [ ln ϕ (x , x )]

q q ∑i,j i,j i j

marginals should be "valid"

q , q

i,j i

q (x , x ) = q (x ) ∀i, j ∈ E, x ∑xi

i,j i j j j j

a real distribution with these marginals should exist marginal polytope for tree graphical models this local consistency is enough

H(q ) − (∣Nb ∣ − 1)H(q ) ∑i,j∈E

i,j

∑i

i i

q (x , x ) ln ϕ (x , x ) ∑i,j∈E ∑xi,j

i,j i j i,j i j

slide-17
SLIDE 17

Variational derivation of BP Variational derivation of BP

arg max H(q ) − (∣Nb ∣ − 1)H(q ) + q (x , x ) ln ϕ (x , x )

{q} ∑i,j∈E i,j

∑i

i i

∑i,j∈E ∑xi,j

i,j i j i,j i j

slide-18
SLIDE 18

Variational derivation of BP Variational derivation of BP

arg max H(q ) − (∣Nb ∣ − 1)H(q ) + q (x , x ) ln ϕ (x , x )

{q} ∑i,j∈E i,j

∑i

i i

∑i,j∈E ∑xi,j

i,j i j i,j i j

q (x , x ) = q (x ) ∀i, j ∈ E, x ∑xi

i,j i j j j j

q (x , x ) ≥ 0 ∀i, j ∈ E, x , x

i,j i j i j

locally consistent marginal distributions

q (x ) = 1 ∀i ∑xi

i i

slide-19
SLIDE 19

Variational derivation of BP Variational derivation of BP

arg max H(q ) − (∣Nb ∣ − 1)H(q ) + q (x , x ) ln ϕ (x , x )

{q} ∑i,j∈E i,j

∑i

i i

∑i,j∈E ∑xi,j

i,j i j i,j i j

q (x , x ) = q (x ) ∀i, j ∈ E, x ∑xi

i,j i j j j j

q (x , x ) ≥ 0 ∀i, j ∈ E, x , x

i,j i j i j

locally consistent marginal distributions

q (x ) = 1 ∀i ∑xi

i i

BP update is derived as "fixed-points" of the Lagrangian

BP messages are the (exponential form of the) Lagrange multipliers

slide-20
SLIDE 20

What happens if there are What happens if there are loops loops?

We can still apply BP update:

δ (x ) ∝ ψ (x , x ) δ (x )

i→j j

∑xi

i,j i j ∏k∈Nb −j

i

k→i k

proportional to normalize the message for numerical stability

slide-21
SLIDE 21

What happens if there are What happens if there are loops loops?

We can still apply BP update:

δ (x ) ∝ ψ (x , x ) δ (x )

i→j j

∑xi

i,j i j ∏k∈Nb −j

i

k→i k

update the messages synchronously or sequentially

proportional to normalize the message for numerical stability

slide-22
SLIDE 22

What happens if there are What happens if there are loops loops?

We can still apply BP update:

δ (x ) ∝ ψ (x , x ) δ (x )

i→j j

∑xi

i,j i j ∏k∈Nb −j

i

k→i k

update the messages synchronously or sequentially may not converge (oscillating behavior)

proportional to normalize the message for numerical stability

slide-23
SLIDE 23

What happens if there are What happens if there are loops loops?

We can still apply BP update:

δ (x ) ∝ ψ (x , x ) δ (x )

i→j j

∑xi

i,j i j ∏k∈Nb −j

i

k→i k

update the messages synchronously or sequentially may not converge (oscillating behavior) even when convergent only gives an approximation:

(x ) ∝ δ (x ) p ^

i

∏k∈Nbi

k→i i

is not (proportional to) the exact marginal

p(x )

i

proportional to normalize the message for numerical stability

slide-24
SLIDE 24

Loopy BP on Loopy BP on factor graphs factor graphs

x1 x2 x3 x4 x5 ψ{1,2,3} ψ{3,5} p(x) = ψ (x )

Z 1 ∏I I I

factor nodes variable nodes

is a subset of variables I ⊆ {1, … , N}

slide-25
SLIDE 25

Loopy BP on Loopy BP on factor graphs factor graphs

δ (x ) ∝ δ (x )

i→I i

∏J∣i∈J,J≠I

J→i i

variable-to-factor message:

x1 x2 x3 x4 x5 ψ{1,2,3} ψ{3,5} p(x) = ψ (x )

Z 1 ∏I I I

factor nodes variable nodes

is a subset of variables I ⊆ {1, … , N}

slide-26
SLIDE 26

Loopy BP on Loopy BP on factor graphs factor graphs

δ (x ) ∝ δ (x )

i→I i

∏J∣i∈J,J≠I

J→i i

variable-to-factor message:

x1 x2 x3 x4 x5 ψ{1,2,3} ψ{3,5} p(x) = ψ (x )

Z 1 ∏I I I

factor nodes variable nodes

is a subset of variables I ⊆ {1, … , N}

factor-to-variable message:

δ (x ) ∝ ψ (x ) δ (x )

I→i i

∑xI−i

I I ∏j∈I−i j→I i

slide-27
SLIDE 27

Loopy BP on Loopy BP on factor graphs factor graphs

δ (x ) ∝ δ (x )

i→I i

∏J∣i∈J,J≠I

J→i i

variable-to-factor message:

(x ) ∝ δ (x ) p ^

i

∏J∣i∈J

J→i i

x1 x2 x3 x4 x5 ψ{1,2,3} ψ{3,5} p(x) = ψ (x )

Z 1 ∏I I I

factor nodes variable nodes

is a subset of variables I ⊆ {1, … , N}

factor-to-variable message:

δ (x ) ∝ ψ (x ) δ (x )

I→i i

∑xI−i

I I ∏j∈I−i j→I i

after convergence:

slide-28
SLIDE 28

Loopy BP on factor graphs: Loopy BP on factor graphs: complexity complexity

x1 x2 x3 x4 x5 ψ{1,2,3} ψ{3,5}

ndΔmax2

δ (x ) ∝ δ (x )

i→I i

∏J∣i∈J,J≠I

J→i i

variable-to-factor message:

from each var to all neighbors

number of vars domain size (2 for binary) max neighbours

slide-29
SLIDE 29

Loopy BP on factor graphs: Loopy BP on factor graphs: complexity complexity

x1 x2 x3 x4 x5 ψ{1,2,3} ψ{3,5}

ndΔmax2

δ (x ) ∝ δ (x )

i→I i

∏J∣i∈J,J≠I

J→i i

variable-to-factor message: factor-to-variable messages:

δ (x ) ∝ ψ (x ) δ (x )

I→i i

∑xI−i

I I ∏j∈I−i j→I i

from each var to all neighbors

number of vars domain size (2 for binary) max neighbours

md ∣Scope ∣

∣Scope ∣

max

max

number of factors vars in a factor

slide-30
SLIDE 30

Loopy BP on factor graphs: Loopy BP on factor graphs: complexity complexity

x1 x2 x3 x4 x5 ψ{1,2,3} ψ{3,5}

ndΔmax2

δ (x ) ∝ δ (x )

i→I i

∏J∣i∈J,J≠I

J→i i

variable-to-factor message: factor-to-variable messages:

δ (x ) ∝ ψ (x ) δ (x )

I→i i

∑xI−i

I I ∏j∈I−i j→I i

from each var to all neighbors

number of vars domain size (2 for binary) max neighbours

md ∣Scope ∣

∣Scope ∣

max

max

number of factors vars in a factor

slide-31
SLIDE 31

(Loopy) BP has found many applications (Loopy) BP has found many applications

https://graph-tool.skewed.de

Social network analysis: stochastic block modelling Machine Learning: clustering tensor factorization

www.jianxiongxiao.com

Vision: inpainting &denoising stereo matching NLP and bioinformatics: Viterbi algorithm Combinatorial

  • ptimization:
slide-32
SLIDE 32

are observerd are sent through a noisy channel

p(y = 1 ∣ x = 1) = p(y = 0 ∣ x = 0) = 1 − ϵ

i i i i

x , … , x

1 n

y , … , y

1 n

Application: LDPC coding Application: LDPC coding using BP using BP

low-density parity check

slide-33
SLIDE 33

are observerd are sent through a noisy channel

p(y = 1 ∣ x = 1) = p(y = 0 ∣ x = 0) = 1 − ϵ

i i i i

x , … , x

1 n

y , … , y

1 n

the message satisfies parity constraints:

Application: LDPC coding Application: LDPC coding using BP using BP

low-density parity check

slide-34
SLIDE 34

are observerd

p(x ∣ y) = ψ(x , x , x ) (1 − ϵ)I(x = y ) + ϵI(x ≠ y ) ∏s,t,u

s t u ∏i=1 n i i i i

are sent through a noisy channel

image: wainwright&jordan

p(y = 1 ∣ x = 1) = p(y = 0 ∣ x = 0) = 1 − ϵ

i i i i

x , … , x

1 n

y , … , y

1 n

the message satisfies parity constraints: joint dist. over unobserved message:

Application: LDPC coding Application: LDPC coding using BP using BP

low-density parity check

slide-35
SLIDE 35

p(x ∣ y) = ψ(x , x , x ) (1 − ϵ)I(x = y ) + ϵI(x ≠ y ) ∏s,t,u

s t u ∏i=1 n i i i i

image: wainwright&jordan

joint dist. over unobserved message: inference problems most likely joint assignment

x = arg max p(x ∣ y)

∗ x

Application: LDPC coding Application: LDPC coding using BP using BP

low-density parity check

slide-36
SLIDE 36

p(x ∣ y) = ψ(x , x , x ) (1 − ϵ)I(x = y ) + ϵI(x ≠ y ) ∏s,t,u

s t u ∏i=1 n i i i i

image: wainwright&jordan

joint dist. over unobserved message: inference problems most likely joint assignment max-marginals calculate the marginals using loopy BP

x = arg max p(x ∣ y)

∗ x

x = arg max p(x ∣ y)

i ∗ xi i

p(x ∣ y)∀i

i

Application: LDPC coding Application: LDPC coding using BP using BP

low-density parity check

slide-37
SLIDE 37

Application: LDPC coding Application: LDPC coding using BP using BP

p(x ∣ y) = ψ(x , x , x ) (1 − ϵ)I(x = y ) + ϵI(x ≠ y ) ∏s,t,u

s t u ∏i=1 n i i i i

image: wainwright&jordan

joint dist. over unobserved message: inference problems Most likely joint assignment

x = arg max p(x ∣ y)

∗ x

low-density parity check

slide-38
SLIDE 38

Application: LDPC coding Application: LDPC coding using BP using BP

p(x ∣ y) = ψ(x , x , x ) (1 − ϵ)I(x = y ) + ϵI(x ≠ y ) ∏s,t,u

s t u ∏i=1 n i i i i

image: wainwright&jordan

joint dist. over unobserved message: inference problems Most likely joint assignment Max-marginals calculate the marginals using loopy BP

x = arg max p(x ∣ y)

∗ x

x = arg max p(x ∣ y)

i ∗ xi i

p(x ∣ y)∀i

i

low-density parity check

slide-39
SLIDE 39

Loops and variational interepretation Loops and variational interepretation

arg max H(q) + E [ ln ϕ (x , x )]

q q ∑i,j i,j i j

H(q ) − (∣Nb ∣ − 1)H(q ) ∑i,j∈E

i,j

∑i

i i

q (x , x ) ln ϕ (x , x ) ∑i,j∈E ∑xi,j

i,j i j i,j i j

slide-40
SLIDE 40

Loops and variational interepretation Loops and variational interepretation

arg max H(q) + E [ ln ϕ (x , x )]

q q ∑i,j i,j i j

the entropy term is not exact anymore called Bethe approximation to the entropy generally not convex anymore (multiple fixed points)

H(q ) − (∣Nb ∣ − 1)H(q ) ∑i,j∈E

i,j

∑i

i i

q (x , x ) ln ϕ (x , x ) ∑i,j∈E ∑xi,j

i,j i j i,j i j

slide-41
SLIDE 41

arg max H(q) + E [ ln ϕ (x , x )]

q q ∑i,j i,j i j

q (x , x ) = q (x ) ∀i, j ∈ E, x ∑xi

i,j i j j j j

L :

Loops and variational interepretation Loops and variational interepretation

slide-42
SLIDE 42

arg max H(q) + E [ ln ϕ (x , x )]

q q ∑i,j i,j i j

the entropy term is not exact anymore Local consistency constraints are inadequate: locally consistent may not be marginals for any joint dist.

i.e., local consistency polytope is an outer bound on the marginal polytope

q (x , x ) = q (x ) ∀i, j ∈ E, x ∑xi

i,j i j j j j

q , q

i,j i

L :

Loops and variational interepretation Loops and variational interepretation

slide-43
SLIDE 43

arg max H(q) + E [ ln ϕ (x , x )]

q q ∑i,j i,j i j

the entropy term is not exact anymore Local consistency constraints are inadequate: locally consistent may not be marginals for any joint dist.

i.e., local consistency polytope is an outer bound on the marginal polytope

q (x , x ) = q (x ) ∀i, j ∈ E, x ∑xi

i,j i j j j j

q , q

i,j i

[q , … , q , q , … , q ]

1 n 1,3 m,n

[p , … , p , p , … , p ]

1 n 1,3 m,n

L :

Loops and variational interepretation Loops and variational interepretation

slide-44
SLIDE 44

arg max H(q) + E [ ln ϕ (x , x )]

q q ∑i,j i,j i j

the entropy term is not exact anymore: improved entropy approximations (e.g., region-based, convex) local consistency constraints are inadequate tighter constraints (e.g., marginal consistency of larger clusters)

Variations on BP Variations on BP

slide-45
SLIDE 45

cluster-graph generalizes clique-tree

clusters are not necessarily max-cliques running intersection property family-preserving property

Variations on BP: Variations on BP: cluster-graph cluster-graph

S ⊆ C ∩ C

i,j i j

instead of = in clique-tree

slide-46
SLIDE 46

cluster-graph generalizes clique-tree

clusters are not necessarily max-cliques running intersection property family-preserving property

Variations on BP: Variations on BP: cluster-graph cluster-graph

S ⊆ C ∩ C

i,j i j

instead of = in clique-tree

similar reparametrization:

p(x) ∝

(S ) ∏i,j p ^

i,j

(C ) ∏i p ^

i

instead of = in clique-tree

slide-47
SLIDE 47

cluster-graph generalizes clique-tree

clusters are not necessarily max-cliques running intersection property family-preserving property

Variations on BP: Variations on BP: cluster-graph cluster-graph

S ⊆ C ∩ C

i,j i j

instead of = in clique-tree

a factor-graph

A B C D E F

similar reparametrization:

p(x) ∝

(S ) ∏i,j p ^

i,j

(C ) ∏i p ^

i

instead of = in clique-tree

slide-48
SLIDE 48

cluster-graph generalizes clique-tree

clusters are not necessarily max-cliques running intersection property family-preserving property

Variations on BP: Variations on BP: cluster-graph cluster-graph

S ⊆ C ∩ C

i,j i j

instead of = in clique-tree

a factor-graph

A B C D E F

corresponding cluster-graph (the same BP updates)

similar reparametrization:

p(x) ∝

(S ) ∏i,j p ^

i,j

(C ) ∏i p ^

i

instead of = in clique-tree

slide-49
SLIDE 49

cluster-graph generalizes clique-tree

clusters are not necessarily max-cliques running intersection property family-preserving property

Variations on BP: Variations on BP: cluster-graph cluster-graph

S ⊆ C ∩ C

i,j i j

instead of = in clique-tree

a factor-graph

A B C D E F

corresponding cluster-graph (the same BP updates) improved cluster-graph (better entropy approximation + marginal constraint)

similar reparametrization:

p(x) ∝

(S ) ∏i,j p ^

i,j

(C ) ∏i p ^

i

instead of = in clique-tree

slide-50
SLIDE 50

BP BP in practice in practice

works well when:

locally tree-like graphs dense graphs with weak interactions

sequential update works better than parallel update

δ (x ) ∝ (1 − α)δ (x ) + α δ (x )

i→I (t+1) i i→I (t) i

∏J∣i∈J,J≠I

J→i (t) i

improved convergence by damping (smoothing) the update

11 x 11 Ising grid

slide-51
SLIDE 51

Summary Summary

belief propagation: efficient deterministic inference exact in clique-tree = variable elimination application of distributive law

slide-52
SLIDE 52

Summary Summary

belief propagation: efficient deterministic inference exact in clique-tree = variable elimination application of distributive law

  • ptimization perspective:

KL-divergence minimization

slide-53
SLIDE 53

Summary Summary

belief propagation: efficient deterministic inference exact in clique-tree = variable elimination application of distributive law

  • ptimization perspective:

KL-divergence minimization works well in (cluster) graphs with loops (large tree-width): approximate objective (Bethe free energy) and constraints