Approximate inference on graphical models: variational methods
Alexandre Bouchard-Cˆ
- t´
e
Approximate inference on graphical models: variational methods - - PowerPoint PPT Presentation
Approximate inference on graphical models: variational methods Alexandre Bouchard-C ot e Exact inference in general graphs. . . Recall: we now have an exact, general and efficient inference algorithm: the Junction-Tree algorithm
Approximate inference on graphical models: variational methods
Alexandre Bouchard-Cˆ
e
Exact inference in general graphs. . .
Junction-Tree algorithm
Exact inference in general graphs is hard
We need approximate inference
Variational methods for approximate inference
– Cast the inference problem into a variational (optimization) problem – Relax (simplify) the variational problem
Variational vs. sampling approaches
Sampling Variational + Converges to the true answer Generally very fast Large toolbox and literature Deterministic algorithms − Mixing can be slow Approximation can be poor Assessing convergence Approximation can fail
Program
Examples will be on the good old Ising model
neighbors: Pθ(X = x) = exp
(s,t)∈E
θs,txsxt − A(θ)
derivations
Importance and physical interpretation
– vertices represent spin of particles – edges represent bonds
Example 1: Loopy Belief Propagation
Mt→s(xs) ∝
φs,t(xs, xt)φt(xt)
Mu→t(xt)
neighbors: with this protocol, makes sense only on trees
Example 1: Loopy Belief Propagation
neighbors: with this protocol, makes sense only on trees
then, at every iteration, all nodes send a message using what they received from the previous iteration
Example 2: Naive mean field
µu ← 1 1 + exp
s∈N(u)−{u} θs,uµs
The plan
– A(θ) is its maximum value – µ(θ) is the maximizing argument
How to construct a variational formulation for A?
Convex Duality
equivalence explicit: f ∗(y) := sup
x∈Rd
Geometric picture
differentiable and strictly convex (can be made more general!!)
Connection with our problem
f ∗∗ := (f ∗)∗ = f
A(θ) = A∗∗(θ) = sup
x
a variational formulation.
f ∗∗ = f (1)
f ∗(y0) := sup
x∈Rd
d dx
x → f ′(x) is invertible
f ∗∗ = f (1)
f ∗(y0) := sup
x∈Rd
d dx
f ∗(y) =
f ∗∗ = f (2)
f ∗∗(y0) := sup
x∈Rd
using the expression f ∗(x) = xf ′−1(x) − f(f ′−1(x)) we found in the previous slide.
f ∗′ = d dx
(f ′−1′(x))f ′′(x) = f ′−1(x)
f ∗∗ = f (3)
conjugate (in the context of our restricted assumptions): f ∗′(x) = f ′−1(x)
result, we get f ∗∗ = f (check).
Caveat
parameters
Example: Bernouilli random variable
A∗(µ) =
for µ ∈ ℑ(A′), +∞
.
Example: Bernouilli random variable
exp θ 1+exp θ,
µ 1−µ, for µ ∈ ℑ(A′) = (0, 1)
Plug-in these in A∗(µ) = µA′−1(µ) − A(A′−1(µ))
µ 1−µ, for µ ∈ ℑ(A′) = (0, 1)
A∗(µ) = µA′−1(µ) − A(A′−1(µ)) = µ log µ 1 − µ − log
µ 1 − µ
General expression for A∗
r.v. characterized by the moment parameters µ, then A∗(µ) =
if µ ∈ M +∞
Negative entropy interpretation
A∗(µ) = θ(µ), µ − A(θ(µ)) = θ(µ), Eθ(µ)φ(X) − log Z(θ(µ)) = Eθ(µ) log expθ(µ), φ(X) − Eθ(µ) log Z(θ(µ)) = Eθ(µ) log expθ(µ), φ(X) Z(θ(µ)) = Eθ(µ) log pµ(X) = −Hµ
Finally, a variational formulation
sup
µ
– has optimal value A(θ0),
– value µmax achieving the sup is s.t. θ0 = A∗′(µmax) – using A∗′(x) = A′−1(x), get µmax = A′(θ0) – hence µmax = Eθ0φ(X)
Finally, a variational formulation
sup
µ
– has optimal solution A(θ0), – is maximized by µ = Eθφ(X)
– restrict M to a subset, ˜ M ⊂ M on which Hµ is easy to compute – approximate Hµ
Mean field: a relaxation on M
removing edges in a graph. model ⇒ adding independence constraints ⇒ smaller set of realizable mean parameters
˜ M := {µs,t : µs,t = µsµt}
On ˜ M , the entropy decomposes
˜ M , we have (by Hammersley-Clifford): −Hµ = Eθ(µ) log
ps(Xs; θ(µ)) =
sup
µ
takes the form: sup
µ (s,t)∈E
θs,tµsµt −
˜ M = {µs,t : µs,t = µsµt},
Easy to solve this optimization problem
θs,tµsµt −
d dµu (. . . ) =
θu,vµv − µu µu + log µu + 1 − µu µu − 1 − log(1 − µi)
µu ← 1 1 + exp
v∈N(u)−{u} θu,vµv
Loopy BP: a relaxation on M and Hµ
– an outer approximation on M and – also an approximation on the entropy Hµ
HBethe(µ) :=
Hs(µs) −
Is,t(µs,t)
more details
Extensions
factorized graph. model, use spanning tree:
Summary
– Fundamental idea from convex analysis: functions have a dual representation: a locus of points and a set of supporting tangents – Convex conjugate function f ∗ has f ∗∗ = f for convex f
sup
µ
– has optimal solution A(θ0), – is maximized by µ = Eθφ(X)
Summary
– mean field, an instance of (1) (inner relaxation) – loopy BP, a combination of (1) and (2) (outer relaxation, Bethe approximation)
– “A Variational Principle for Graphical Models”, Wainwright and Jordan – “Tutorial on variational approximation methods”, T. Jaakkola