[PPT] - Approximate inference on graphical models: variational methods PowerPoint Presentation

SLIDE 1

Approximate inference on graphical models: variational methods

Alexandre Bouchard-Cˆ

t´

e

SLIDE 2

Exact inference in general graphs. . .

Recall: we now have an exact, general and “efficient” inference algorithm: the

Junction-Tree algorithm

Why should you care about approximate inference?

SLIDE 3

Exact inference in general graphs is hard

Running time of JT is exponential in the max clique size of the JT
Don’t have to look very far to find graphs were JT is arbitrary slow

SLIDE 4

We need approximate inference

A very hot topic in the Machine Learning community
Lots of important, open problems
Next lectures: Markov Chain Monte Carlo (MCMC) algorithms
Today: a completely different approach. . .

SLIDE 5

Variational methods for approximate inference

Framework:

– Cast the inference problem into a variational (optimization) problem – Relax (simplify) the variational problem

SLIDE 6

Variational vs. sampling approaches

Sampling Variational + Converges to the true answer Generally very fast Large toolbox and literature Deterministic algorithms − Mixing can be slow Approximation can be poor Assessing convergence Approximation can fail

SLIDE 7

Program

Specific examples of variational methods
Outline of the unifying theory
Examples revisited

SLIDE 8

Examples

SLIDE 9

Examples will be on the good old Ising model

An undirected graphical model structured as a lattice in Rd
Sufficient statistics φ(xs, xt) = xsxt, xu ∈ {−1, +1} encourage agreement of

neighbors: Pθ(X = x) = exp

(s,t)∈E

θs,txsxt − A(θ)

.
We will actually use xu ∈ {0, 1} in the derivations to slightly simplify the

derivations

SLIDE 10

Importance and physical interpretation

For grid in dimension > 2, encapsulates the full hardness of inference
Originates from statistical physics: model for a crystal structure

– vertices represent spin of particles – edges represent bonds

Demonstration. . .

SLIDE 11

Example 1: Loopy Belief Propagation

Run max-product, even if you are not supposed to. . .

Mt→s(xs) ∝

xt∈{0,1}

φs,t(xs, xt)φt(xt)

u∈N(t)−{s}

Mu→t(xt)

t sends message to s when it has received the messages from all the other

neighbors: with this protocol, makes sense only on trees

SLIDE 12

Example 1: Loopy Belief Propagation

t sends message to s when it has received the messages from all the other

neighbors: with this protocol, makes sense only on trees

On trees, the following protocol is equivalent: initialize the messages to one,

then, at every iteration, all nodes send a message using what they received from the previous iteration

Makes sense on arbitrary graphs!
Does it work?

SLIDE 13

Example 2: Naive mean field

A simpler coordinate ascent algorithm

µu ← 1 1 + exp

− 2

s∈N(u)−{u} θs,uµs

Our goal is to make sense out of these algorithms

SLIDE 14

Unifying theory

SLIDE 15

The plan

Focus on computing A(θ) and µ(θ) = Eθφ(X)
Construct an optimization problem s.t.

– A(θ) is its maximum value – µ(θ) is the maximizing argument

Relax/simplify this optimization problem

SLIDE 16

How to construct a variational formulation for A?

Key concept: convex duality (recall A is convex. . . )
Two equivalent ways to specify convex functions

SLIDE 17

Convex Duality

The convex conjugate of f : Rd → R ∪ {+∞}, denoted f ∗ makes this

equivalence explicit: f ∗(y) := sup

x∈Rd

y, x − f(x)
,
set f ∗(x) = +∞ for unbounded values: f ∗ : Rd → R ∪ {+∞}.

SLIDE 18

Geometric picture

Warning: for pedagogical reasons, assume for now that f is univariate, twice

differentiable and strictly convex (can be made more general!!)

“f acts on points, f ∗ acts on tangents”

SLIDE 19

Connection with our problem

We will show that for convex f:

f ∗∗ := (f ∗)∗ = f

Using this with f = A and expanding the definition of convex conjugacy:

A(θ) = A∗∗(θ) = sup

x

θ, x − A∗(x)
,

a variational formulation.

SLIDE 20

f ∗∗ = f (1)

First, let us fix some y0 ∈ R and find a “closed form” for f ∗(y0):

f ∗(y0) := sup

x∈Rd

xy0 − f(x)
,
Use differentiability and convexity to apply the derivative test:

d dx

xy0 − f(x)
= y0 − f ′(x) = 0 ⇒ f ′(xmax) = y0
Observation: f strictly convex ⇒ f ′ strictly increasing ⇒ the function

x → f ′(x) is invertible

SLIDE 21

f ∗∗ = f (1)

First, let us fix some y0 ∈ R and find a “closed form” for f ∗(y0):

f ∗(y0) := sup

x∈Rd

xy0 − f(x)
,
Use differentiability and convexity to apply the derivative test:

d dx

xy0 − f(x)
= 0 ⇒ f ′(xmax) = y0 ⇒ xmax = f ′−1(y0)
Plug this in xy0 − f(x) to get f ∗:

f ∗(y) =

yxmax − f(xmax)
= yf ′−1(y) − f(f ′−1(y))

SLIDE 22

f ∗∗ = f (2)

We can repeat the same process on

f ∗∗(y0) := sup

x∈Rd

xy0 − f ∗(x)
,

using the expression f ∗(x) = xf ′−1(x) − f(f ′−1(x)) we found in the previous slide.

f ∗′ looks bad at first but things cancel out:

f ∗′ = d dx

xf ′−1(x) − f(f ′−1(x))
= f ′−1(x) + x(f ′−1′(x))f ′′(x) − f ′(f ′−1(x))
x

(f ′−1′(x))f ′′(x) = f ′−1(x)

SLIDE 23

f ∗∗ = f (3)

This actually yields an alternate, implicit characterization of the convex

conjugate (in the context of our restricted assumptions): f ∗′(x) = f ′−1(x)

can already see from this equation applied twice that f ∗∗ = f up to a constant
Applying the derivative test as in slide (1) of the derivation and using this

result, we get f ∗∗ = f (check).

SLIDE 24

Caveat

There was a problem with this derivation:
f ′(f ′−1(x)) = x only defined for x in the image of f ′, ℑ(f ′)
Set to +∞ otherwise
In inference of A setting: will correspond to constraints on the realizable mean

parameters

SLIDE 25

Example: Bernouilli random variable

P(X = x) ∝ exp(θx) for x ∈ {0, 1}, θ ∈ R
A(θ) = log(1 + exp(θ))
Let’s compute A∗ using the formula we derived:

A∗(µ) =

µA′−1(µ) − A(A′−1(µ))

for µ ∈ ℑ(A′), +∞

therwise

.

By the way, recall: Eθφ(X) = A′(θ), here φ(x) = x, that explains the notation

SLIDE 26

Example: Bernouilli random variable

A(θ) = log(1 + exp(θ)),
A′(θ) =

exp θ 1+exp θ,

A′−1(µ) = log

µ 1−µ, for µ ∈ ℑ(A′) = (0, 1)

SLIDE 27

Plug-in these in A∗(µ) = µA′−1(µ) − A(A′−1(µ))

A(θ) = log(1 + exp(θ)),
A′−1(µ) = log

µ 1−µ, for µ ∈ ℑ(A′) = (0, 1)

Get, for µ ∈ ℑ(A′) = (0, 1):

A∗(µ) = µA′−1(µ) − A(A′−1(µ)) = µ log µ 1 − µ − log

1 + exp log

µ 1 − µ

= µ log µ + (1 − µ) log(1 − µ)
Does that look familiar?

SLIDE 28

General expression for A∗

This is the negative entropy!
This actually holds in general: let Hµ := −Eµ log pµ(X) be the entropy of the

r.v. characterized by the moment parameters µ, then A∗(µ) =

−Hµ

if µ ∈ M +∞

therwise
Here, M := ℑ(A′) is the set of realizable mean parameters

SLIDE 29

Negative entropy interpretation

General derivation: for µ ∈ M :

A∗(µ) = θ(µ), µ − A(θ(µ)) = θ(µ), Eθ(µ)φ(X) − log Z(θ(µ)) = Eθ(µ) log expθ(µ), φ(X) − Eθ(µ) log Z(θ(µ)) = Eθ(µ) log expθ(µ), φ(X) Z(θ(µ)) = Eθ(µ) log pµ(X) = −Hµ

SLIDE 30

Finally, a variational formulation

Given θ0, the following optimization problem:

sup

µ

θ0, µ + Hµ
such that µ ∈ M ,

– has optimal value A(θ0),

Moreover: it is maximized by µ = Eθφ(X):

– value µmax achieving the sup is s.t. θ0 = A∗′(µmax) – using A∗′(x) = A′−1(x), get µmax = A′(θ0) – hence µmax = Eθ0φ(X)

SLIDE 31

Finally, a variational formulation

Given θ0, the following optimization problem:

sup

µ

θ0, µ + Hµ
such that µ ∈ M ,

– has optimal solution A(θ0), – is maximized by µ = Eθφ(X)

How can we relax this optimization problem?

– restrict M to a subset, ˜ M ⊂ M on which Hµ is easy to compute – approximate Hµ

SLIDE 32

Examples, revisited

SLIDE 33

Mean field: a relaxation on M

Mean field as a relaxation on M
Recall:

removing edges in a graph. model ⇒ adding independence constraints ⇒ smaller set of realizable mean parameters

Formally:

˜ M := {µs,t : µs,t = µsµt}

SLIDE 34

On ˜ M , the entropy decomposes

For µ ∈

˜ M , we have (by Hammersley-Clifford): −Hµ = Eθ(µ) log

s∈V

ps(Xs; θ(µ)) =

s∈V
µs log µs + (1 − µs) log(1 − µs)
So the variational formulation

sup

µ

θ0, µ + Hµ
such that µ ∈ M ,

takes the form: sup

µ (s,t)∈E

θs,tµsµt −

s∈V
µs log µs + (1 − µs) log(1 − µs)
such that µ ∈

˜ M = {µs,t : µs,t = µsµt},

SLIDE 35

Easy to solve this optimization problem

It is easy to solve with succesive coordinate maximization
Pick a coordinate µu, compute gradient of objective
(s,t)∈E

θs,tµsµt −

s∈V
µs log µs + (1 − µs) log(1 − µs)
with respect to µu:

d dµu (. . . ) =

v∈N(u)−{u}

θu,vµv − µu µu + log µu + 1 − µu µu − 1 − log(1 − µi)

Set it to zero and get the naive mean field algorithm:

µu ← 1 1 + exp

−

v∈N(u)−{u} θu,vµv

SLIDE 36

Loopy BP: a relaxation on M and Hµ

Mean field used an inner approximation of M
Loopy BP uses:

– an outer approximation on M and – also an approximation on the entropy Hµ

Key idea: Bethe approximation:

HBethe(µ) :=

s∈V

Hs(µs) −

(s,t)∈E

Is,t(µs,t)

See “A Variational Principle for Graphical Models”, Wainwright and Jordan for

more details

SLIDE 37

Extensions

Variational formulations suggest useful extensions
e.g., in the mean field approximation, instead of restricting M to completely

factorized graph. model, use spanning tree:

SLIDE 38

Summary

It is useful to cast inference hard inference problems into optimization problems
How to cast?

– Fundamental idea from convex analysis: functions have a dual representation: a locus of points and a set of supporting tangents – Convex conjugate function f ∗ has f ∗∗ = f for convex f

Get: Given θ0, the following optimization problem:

sup

µ

θ0, µ + Hµ
such that µ ∈ M ,

– has optimal solution A(θ0), – is maximized by µ = Eθφ(X)

SLIDE 39

Summary

Then, relax the optimization problem. Two approaches
1. relaxation on the realizable parameters M
2. approximation of Hµ
We have seen two specific examples:

– mean field, an instance of (1) (inner relaxation) – loopy BP, a combination of (1) and (2) (outer relaxation, Bethe approximation)

Additional readings:

– “A Variational Principle for Graphical Models”, Wainwright and Jordan – “Tutorial on variational approximation methods”, T. Jaakkola