Approximate inference on graphical models: variational methods - - PowerPoint PPT Presentation

approximate inference on graphical models variational
SMART_READER_LITE
LIVE PREVIEW

Approximate inference on graphical models: variational methods - - PowerPoint PPT Presentation

Approximate inference on graphical models: variational methods Alexandre Bouchard-C ot e Exact inference in general graphs. . . Recall: we now have an exact, general and efficient inference algorithm: the Junction-Tree algorithm


slide-1
SLIDE 1

Approximate inference on graphical models: variational methods

Alexandre Bouchard-Cˆ

e

slide-2
SLIDE 2

Exact inference in general graphs. . .

  • Recall: we now have an exact, general and “efficient” inference algorithm: the

Junction-Tree algorithm

  • Why should you care about approximate inference?
slide-3
SLIDE 3

Exact inference in general graphs is hard

  • Running time of JT is exponential in the max clique size of the JT
  • Don’t have to look very far to find graphs were JT is arbitrary slow
slide-4
SLIDE 4

We need approximate inference

  • A very hot topic in the Machine Learning community
  • Lots of important, open problems
  • Next lectures: Markov Chain Monte Carlo (MCMC) algorithms
  • Today: a completely different approach. . .
slide-5
SLIDE 5

Variational methods for approximate inference

  • Framework:

– Cast the inference problem into a variational (optimization) problem – Relax (simplify) the variational problem

slide-6
SLIDE 6

Variational vs. sampling approaches

Sampling Variational + Converges to the true answer Generally very fast Large toolbox and literature Deterministic algorithms − Mixing can be slow Approximation can be poor Assessing convergence Approximation can fail

slide-7
SLIDE 7

Program

  • Specific examples of variational methods
  • Outline of the unifying theory
  • Examples revisited
slide-8
SLIDE 8

Examples

slide-9
SLIDE 9

Examples will be on the good old Ising model

  • An undirected graphical model structured as a lattice in Rd
  • Sufficient statistics φ(xs, xt) = xsxt, xu ∈ {−1, +1} encourage agreement of

neighbors: Pθ(X = x) = exp

(s,t)∈E

θs,txsxt − A(θ)

  • .
  • We will actually use xu ∈ {0, 1} in the derivations to slightly simplify the

derivations

slide-10
SLIDE 10

Importance and physical interpretation

  • For grid in dimension > 2, encapsulates the full hardness of inference
  • Originates from statistical physics: model for a crystal structure

– vertices represent spin of particles – edges represent bonds

  • Demonstration. . .
slide-11
SLIDE 11

Example 1: Loopy Belief Propagation

  • Run max-product, even if you are not supposed to. . .

Mt→s(xs) ∝

  • xt∈{0,1}

φs,t(xs, xt)φt(xt)

  • u∈N(t)−{s}

Mu→t(xt)

  • t sends message to s when it has received the messages from all the other

neighbors: with this protocol, makes sense only on trees

slide-12
SLIDE 12

Example 1: Loopy Belief Propagation

  • t sends message to s when it has received the messages from all the other

neighbors: with this protocol, makes sense only on trees

  • On trees, the following protocol is equivalent: initialize the messages to one,

then, at every iteration, all nodes send a message using what they received from the previous iteration

  • Makes sense on arbitrary graphs!
  • Does it work?
slide-13
SLIDE 13

Example 2: Naive mean field

  • A simpler coordinate ascent algorithm

µu ← 1 1 + exp

  • − 2

s∈N(u)−{u} θs,uµs

  • Our goal is to make sense out of these algorithms
slide-14
SLIDE 14

Unifying theory

slide-15
SLIDE 15

The plan

  • Focus on computing A(θ) and µ(θ) = Eθφ(X)
  • Construct an optimization problem s.t.

– A(θ) is its maximum value – µ(θ) is the maximizing argument

  • Relax/simplify this optimization problem
slide-16
SLIDE 16

How to construct a variational formulation for A?

  • Key concept: convex duality (recall A is convex. . . )
  • Two equivalent ways to specify convex functions
slide-17
SLIDE 17

Convex Duality

  • The convex conjugate of f : Rd → R ∪ {+∞}, denoted f ∗ makes this

equivalence explicit: f ∗(y) := sup

x∈Rd

  • y, x − f(x)
  • ,
  • set f ∗(x) = +∞ for unbounded values: f ∗ : Rd → R ∪ {+∞}.
slide-18
SLIDE 18

Geometric picture

  • Warning: for pedagogical reasons, assume for now that f is univariate, twice

differentiable and strictly convex (can be made more general!!)

  • “f acts on points, f ∗ acts on tangents”
slide-19
SLIDE 19

Connection with our problem

  • We will show that for convex f:

f ∗∗ := (f ∗)∗ = f

  • Using this with f = A and expanding the definition of convex conjugacy:

A(θ) = A∗∗(θ) = sup

x

  • θ, x − A∗(x)
  • ,

a variational formulation.

slide-20
SLIDE 20

f ∗∗ = f (1)

  • First, let us fix some y0 ∈ R and find a “closed form” for f ∗(y0):

f ∗(y0) := sup

x∈Rd

  • xy0 − f(x)
  • ,
  • Use differentiability and convexity to apply the derivative test:

d dx

  • xy0 − f(x)
  • = y0 − f ′(x) = 0 ⇒ f ′(xmax) = y0
  • Observation: f strictly convex ⇒ f ′ strictly increasing ⇒ the function

x → f ′(x) is invertible

slide-21
SLIDE 21

f ∗∗ = f (1)

  • First, let us fix some y0 ∈ R and find a “closed form” for f ∗(y0):

f ∗(y0) := sup

x∈Rd

  • xy0 − f(x)
  • ,
  • Use differentiability and convexity to apply the derivative test:

d dx

  • xy0 − f(x)
  • = 0 ⇒ f ′(xmax) = y0 ⇒ xmax = f ′−1(y0)
  • Plug this in xy0 − f(x) to get f ∗:

f ∗(y) =

  • yxmax − f(xmax)
  • = yf ′−1(y) − f(f ′−1(y))
slide-22
SLIDE 22

f ∗∗ = f (2)

  • We can repeat the same process on

f ∗∗(y0) := sup

x∈Rd

  • xy0 − f ∗(x)
  • ,

using the expression f ∗(x) = xf ′−1(x) − f(f ′−1(x)) we found in the previous slide.

  • f ∗′ looks bad at first but things cancel out:

f ∗′ = d dx

  • xf ′−1(x) − f(f ′−1(x))
  • = f ′−1(x) + x(f ′−1′(x))f ′′(x) − f ′(f ′−1(x))
  • x

(f ′−1′(x))f ′′(x) = f ′−1(x)

slide-23
SLIDE 23

f ∗∗ = f (3)

  • This actually yields an alternate, implicit characterization of the convex

conjugate (in the context of our restricted assumptions): f ∗′(x) = f ′−1(x)

  • can already see from this equation applied twice that f ∗∗ = f up to a constant
  • Applying the derivative test as in slide (1) of the derivation and using this

result, we get f ∗∗ = f (check).

slide-24
SLIDE 24

Caveat

  • There was a problem with this derivation:
  • f ′(f ′−1(x)) = x only defined for x in the image of f ′, ℑ(f ′)
  • Set to +∞ otherwise
  • In inference of A setting: will correspond to constraints on the realizable mean

parameters

slide-25
SLIDE 25

Example: Bernouilli random variable

  • P(X = x) ∝ exp(θx) for x ∈ {0, 1}, θ ∈ R
  • A(θ) = log(1 + exp(θ))
  • Let’s compute A∗ using the formula we derived:

A∗(µ) =

  • µA′−1(µ) − A(A′−1(µ))

for µ ∈ ℑ(A′), +∞

  • therwise

.

  • By the way, recall: Eθφ(X) = A′(θ), here φ(x) = x, that explains the notation
slide-26
SLIDE 26

Example: Bernouilli random variable

  • A(θ) = log(1 + exp(θ)),
  • A′(θ) =

exp θ 1+exp θ,

  • A′−1(µ) = log

µ 1−µ, for µ ∈ ℑ(A′) = (0, 1)

slide-27
SLIDE 27

Plug-in these in A∗(µ) = µA′−1(µ) − A(A′−1(µ))

  • A(θ) = log(1 + exp(θ)),
  • A′−1(µ) = log

µ 1−µ, for µ ∈ ℑ(A′) = (0, 1)

  • Get, for µ ∈ ℑ(A′) = (0, 1):

A∗(µ) = µA′−1(µ) − A(A′−1(µ)) = µ log µ 1 − µ − log

  • 1 + exp log

µ 1 − µ

  • = µ log µ + (1 − µ) log(1 − µ)
  • Does that look familiar?
slide-28
SLIDE 28

General expression for A∗

  • This is the negative entropy!
  • This actually holds in general: let Hµ := −Eµ log pµ(X) be the entropy of the

r.v. characterized by the moment parameters µ, then A∗(µ) =

  • −Hµ

if µ ∈ M +∞

  • therwise
  • Here, M := ℑ(A′) is the set of realizable mean parameters
slide-29
SLIDE 29

Negative entropy interpretation

  • General derivation: for µ ∈ M :

A∗(µ) = θ(µ), µ − A(θ(µ)) = θ(µ), Eθ(µ)φ(X) − log Z(θ(µ)) = Eθ(µ) log expθ(µ), φ(X) − Eθ(µ) log Z(θ(µ)) = Eθ(µ) log expθ(µ), φ(X) Z(θ(µ)) = Eθ(µ) log pµ(X) = −Hµ

slide-30
SLIDE 30

Finally, a variational formulation

  • Given θ0, the following optimization problem:

sup

µ

  • θ0, µ + Hµ
  • such that µ ∈ M ,

– has optimal value A(θ0),

  • Moreover: it is maximized by µ = Eθφ(X):

– value µmax achieving the sup is s.t. θ0 = A∗′(µmax) – using A∗′(x) = A′−1(x), get µmax = A′(θ0) – hence µmax = Eθ0φ(X)

slide-31
SLIDE 31

Finally, a variational formulation

  • Given θ0, the following optimization problem:

sup

µ

  • θ0, µ + Hµ
  • such that µ ∈ M ,

– has optimal solution A(θ0), – is maximized by µ = Eθφ(X)

  • How can we relax this optimization problem?

– restrict M to a subset, ˜ M ⊂ M on which Hµ is easy to compute – approximate Hµ

slide-32
SLIDE 32

Examples, revisited

slide-33
SLIDE 33

Mean field: a relaxation on M

  • Mean field as a relaxation on M
  • Recall:

removing edges in a graph. model ⇒ adding independence constraints ⇒ smaller set of realizable mean parameters

  • Formally:

˜ M := {µs,t : µs,t = µsµt}

slide-34
SLIDE 34

On ˜ M , the entropy decomposes

  • For µ ∈

˜ M , we have (by Hammersley-Clifford): −Hµ = Eθ(µ) log

  • s∈V

ps(Xs; θ(µ)) =

  • s∈V
  • µs log µs + (1 − µs) log(1 − µs)
  • So the variational formulation

sup

µ

  • θ0, µ + Hµ
  • such that µ ∈ M ,

takes the form: sup

µ (s,t)∈E

θs,tµsµt −

  • s∈V
  • µs log µs + (1 − µs) log(1 − µs)
  • such that µ ∈

˜ M = {µs,t : µs,t = µsµt},

slide-35
SLIDE 35

Easy to solve this optimization problem

  • It is easy to solve with succesive coordinate maximization
  • Pick a coordinate µu, compute gradient of objective
  • (s,t)∈E

θs,tµsµt −

  • s∈V
  • µs log µs + (1 − µs) log(1 − µs)
  • with respect to µu:

d dµu (. . . ) =

  • v∈N(u)−{u}

θu,vµv − µu µu + log µu + 1 − µu µu − 1 − log(1 − µi)

  • Set it to zero and get the naive mean field algorithm:

µu ← 1 1 + exp

v∈N(u)−{u} θu,vµv

slide-36
SLIDE 36

Loopy BP: a relaxation on M and Hµ

  • Mean field used an inner approximation of M
  • Loopy BP uses:

– an outer approximation on M and – also an approximation on the entropy Hµ

  • Key idea: Bethe approximation:

HBethe(µ) :=

  • s∈V

Hs(µs) −

  • (s,t)∈E

Is,t(µs,t)

  • See “A Variational Principle for Graphical Models”, Wainwright and Jordan for

more details

slide-37
SLIDE 37

Extensions

  • Variational formulations suggest useful extensions
  • e.g., in the mean field approximation, instead of restricting M to completely

factorized graph. model, use spanning tree:

slide-38
SLIDE 38

Summary

  • It is useful to cast inference hard inference problems into optimization problems
  • How to cast?

– Fundamental idea from convex analysis: functions have a dual representation: a locus of points and a set of supporting tangents – Convex conjugate function f ∗ has f ∗∗ = f for convex f

  • Get: Given θ0, the following optimization problem:

sup

µ

  • θ0, µ + Hµ
  • such that µ ∈ M ,

– has optimal solution A(θ0), – is maximized by µ = Eθφ(X)

slide-39
SLIDE 39

Summary

  • Then, relax the optimization problem. Two approaches
  • 1. relaxation on the realizable parameters M
  • 2. approximation of Hµ
  • We have seen two specific examples:

– mean field, an instance of (1) (inner relaxation) – loopy BP, a combination of (1) and (2) (outer relaxation, Bethe approximation)

  • Additional readings:

– “A Variational Principle for Graphical Models”, Wainwright and Jordan – “Tutorial on variational approximation methods”, T. Jaakkola