Variational Mean Field Variational Mean Field for Graphical Models - - PowerPoint PPT Presentation

variational mean field variational mean field for
SMART_READER_LITE
LIVE PREVIEW

Variational Mean Field Variational Mean Field for Graphical Models - - PowerPoint PPT Presentation

Variational Mean Field Variational Mean Field for Graphical Models for Graphical Models CS/CNS/EE 155 Baback Moghaddam Machine Learning Group baback @ jpl.nasa.gov Approximate Inference Approximate Inference Consider general UGs ( i.e.,


slide-1
SLIDE 1

Variational Mean Field Variational Mean Field for Graphical Models for Graphical Models

CS/CNS/EE 155

Baback Moghaddam

Machine Learning Group

baback @ jpl.nasa.gov

slide-2
SLIDE 2

Approximate Inference Approximate Inference

  • Consider general UGs (i.e., not tree-structured)
  • All basic computations are intractable (for large G)
  • likelihoods & partition function
  • marginals & conditionals
  • finding modes
slide-3
SLIDE 3

Inference Inference

Exact Exact

VE JT BP

Approximate Approximate Stochastic Stochastic

Gibbs, M-H

(MC) MC, SA

Deterministic Deterministic Cluster Cluster ~MP MP

LBP EP

Variational Variational

Taxonomy of Inference Methods Taxonomy of Inference Methods

slide-4
SLIDE 4

Approximate Inference Approximate Inference

  • Stochastic (Sampling)
  • Metropolis-Hastings, Gibbs, (Markov Chain) Monte Carlo, etc
  • Computationally expensive, but is “exact” (in the limit)
  • Deterministic (Optimization)
  • Mean Field (MF), Loopy Belief Propagation (LBP)
  • Variational Bayes (VB), Expectation Propagation (EP)
  • Computationally cheaper, but is not exact (gives bounds)
slide-5
SLIDE 5
  • riginal G (Naïve) MF H0 structured MF Hs

c c c x

x p ) ( ) ( φ ) ( ) ( ) (

B B A A

x q x q x q ∝

  • General idea
  • approximate p(x) by a simpler factored distribution q(x)
  • minimize “distance” D(p||q)
  • e.g., Kullback-Liebler

i i i x

q x q ) ( ) (

Mean Field : Overview Mean Field : Overview

slide-6
SLIDE 6
  • Naïve MF has roots in Statistical Mechanics (1890s)
  • physics of spin glasses (Ising), ferromagnetism, etc
  • why is it called

why is it called “ “Mean Field Mean Field” ” ? ? with full factorization : E[xi xj] = E[xi] E[xj]

Mean Field : Overview Mean Field : Overview

  • Structured MF is more “modern”

Coupled HMM Structured MF approximation (with tractable chains)

slide-7
SLIDE 7

KL Projection KL Projection D D( (Q||P Q||P) )

  • Infer hidden h given visible v (clamp v nodes with δ ‘s)
  • Variational

Variational : optimize KL globally

the right density form for Q “falls out” KL is easier since we’re taking E[.] wrt simpler Q Q seeks mode with the largest mass (not height) so it will tend to underestimate the support of P

P = 0 forces Q = 0

slide-8
SLIDE 8

KL Projection KL Projection D D( (P||Q P||Q) )

  • Infer hidden h given visible v (clamp v nodes with δ ‘s)
  • Expectation Propagation

Expectation Propagation (EP) : optimize KL locally

this KL is harder since we’re taking E[.] wrt P no nice global solution for Q “falls out” must sequentially tweak each qc (match moments) Q covers all modes so it overestimates support

P > 0 forces Q > 0

slide-9
SLIDE 9

α α -

  • divergences

divergences

  • The 2 basic KL divergences are special cases of
  • Dα(p||q) is non-negative and 0 iff p = q

– when α - 1 we get KL(P||Q) – when α +1 we get KL(Q||P) – when α = 0 D0(P||Q) is proportional to Hellinger Hellinger’ ’s s distance (metric)

So many variational approximations must exist, one for each α !

slide-10
SLIDE 10

for more on α

α -

  • divergences

divergences

Shun-ichi Amari

slide-11
SLIDE 11

for specific examples of

See Chapter 10 Chapter 10

Variational Single Gaussian Variational Linear Regression Variational Mixture of Gaussians Variational Logistic Regression Expectation Propagation (α = -1)

1 ± = α

slide-12
SLIDE 12

Hierarchy of Algorithms Hierarchy of Algorithms

(based on α and structuring)

BP

  • fully factorized
  • KL(p||q)

EP

  • exp family
  • KL(p||q)

FBP

  • fully factorized
  • Dα(p||q)

Power EP

  • exp family
  • Dα(p||q)

MF

  • fully factorized
  • KL(q||p)

TRW

  • fully factorized
  • Dα(p||q) α>1

Structured MF

  • exp family
  • KL(q||p)

by Tom Minka

slide-13
SLIDE 13

Variational MF Variational MF

) (

1 ) ( 1 ) (

x c c c

e Z x Z x p

ψ

γ = =

  • =

c c c x

x )) ( log( ) ( γ ψ

dx e Z

x

  • =

) (

log log

ψ

dx x Q e x Q

x

  • =

) ( ) ( log

) ( ψ

)] ( / [ log

) (

x Q e E

x Q ψ

)] ( / [ log sup

) (

x Q e E

x Q Q ψ

= } )] ( [ )] ( [ { sup x Q H x EQ

Q

+ = ψ

Jensen’s

slide-14
SLIDE 14

Variational MF Variational MF

} )] ( [ )] ( [ { sup log x Q H x E Z

Q Q

+ ≥ ψ

Equality is obtained for Q(x) = P(x) (all Q admissible) Using any other Q yields a lower bound on log Z The slack in this bound is KL-divergence D(Q||P)

Goal: restrict Q to a tractable subclass Q

  • ptimize with supQ to tighten this bound

note we’re (also) maximizing entropy H[Q]

slide-15
SLIDE 15

Variational MF Variational MF

} )] ( [ )] ( [ { sup log x Q H x E Z

Q Q

+ ≥ ψ

Most common specialized family :

“log-linear models” linear in parameters θ (natural parameters of EFs) clique potentials φ(x) (sufficient statistics of EFs)

Fertile ground for plowing Convex Analysis Convex Analysis

) ( ) ( ) ( x x x

T c c c c

φ θ φ θ ψ = =

slide-16
SLIDE 16

The Old Testament The New Testament

Convex Analysis Convex Analysis

slide-17
SLIDE 17

Variational MF for EF Variational MF for EF

} )] ( [ )] ( [ { sup log x Q H x E Z

Q Q

+ ≥ ψ } )] ( [ )] ( [ { sup log x Q H x E Z

T Q Q

+ ≥ φ θ } )] ( [ )] ( [ { sup log x Q H x E Z

Q T Q

+ ≥ φ θ } ) ( * { sup ) ( µ µ θ θ

µ

A A

T M

− ≥

M M = set of all moment parameters realizable under subclass Q

EF notation

slide-18
SLIDE 18

Variational MF for EF Variational MF for EF

So it looks like we are just optimizing a concave function (linear term + negative-entropy) over a convex set Yet it is hard ... Why?

  • 1. graph probability (being a measure) requires a very large

number of marginalization constraints for consistency (leads to a typically beastly marginal polytope M M in the discrete case)

e.g., a complete 7-node graph’s polytope has over 108 facets ! In fact, optimizing just the linear term alone can be hard

  • 2. exact computation of entropy -A*(µ) is highly non-trivial

(hence the famed Bethe & Kikuchi approximations)

slide-19
SLIDE 19

Gibbs Sampling for Gibbs Sampling for Ising Ising

  • Binary MRF G = (V,E) with pairwise clique potentials
  • 1. pick a node s at random
  • 2. sample u ~ Uniform(0,1)
  • 3. update node s :
  • 4. goto step 1

a slower stochastic version of ICM

slide-20
SLIDE 20

Naive MF for Naive MF for Ising Ising

  • use a variational mean parameter at each site
  • 1. pick a node s at random
  • 2. update its parameter :
  • 3. goto step 1
  • deterministic “loopy” message-passing
  • how well does it work? depends on θ
slide-21
SLIDE 21

Graphical Models as EF Graphical Models as EF

  • G

G(V,E) with nodes

  • sufficient stats :
  • clique potentials

likewise for θ st

  • probability
  • log-partition
  • mean parameters
slide-22
SLIDE 22
  • For any mean parameter µ where θ (µ) is the corresponding natural parameter
  • the log-partition function has this variational representation
  • this supremum is achieved at the moment-matching value of µ

Variational Theorem for EF Variational Theorem for EF

in relative interior of M not in the closure of M

slide-23
SLIDE 23
  • Main Idea: (convex) functions can be “supported” (lower-bounded) by a

continuum of lines (hyperplanes) whose intercepts create a conjugate dual

  • f the original function (and vice versa)

conjugate dual of A conjugate dual of A* Note that A** = A (iff A is convex)

Legendre Legendre-

  • Fenchel

Fenchel Duality Duality

slide-24
SLIDE 24

Dual Map for EF Dual Map for EF

Two equivalent parameterizations of the EF Bijective mapping between Ω and the interior of M Mapping is defined by the gradients of A and its dual A* Shape & complexity of M depends on X and size and structure of G

slide-25
SLIDE 25
  • G

G(V,E) = graph with discrete nodes

  • Then M

M = convex hull of all φ(x)

  • equivalent to intersecting half-spaces aT µ > b
  • difficult to characterize for large G
  • hence difficult to optimize over
  • interior of M

M is 1-to-1 with Ω

Marginal Polytope Marginal Polytope

slide-26
SLIDE 26
  • G

G(V,E) = a single Bernoulli node φ(x) = x

  • density
  • log-partition (of course we knew this)
  • we know A* too, but let’s solve for it variationally
  • differentiate

stationary point

  • rearrange to , substitute into A*

The Simplest Graph The Simplest Graph

x

Note: we found both the mean parameter and the lower bound using the variational method

slide-27
SLIDE 27
  • G

G(V,E) = 2 connected Bernoulli nodes

  • moments
  • variational problem
  • solve (it’s still easy!)

The 2 The 2nd

nd Simplest Graph

Simplest Graph

x2 x1

moment constraints

slide-28
SLIDE 28

3 nodes 16 constraints # of constraints blows up real fast: 7 nodes 200,000,000+ constraints hard to keep track of valid µ’s

(i.e., the full shape and extent of M)

no more checking our results against closed-forms expressions that we already knew in advance! unless G remains a tree, entropy A* will not decompose nicely, etc

The 3 The 3rd

rd Simplest Graph

Simplest Graph

x2 x1 x3

slide-29
SLIDE 29
  • tractable subgraph H = (V,0)
  • fully-factored distribution
  • moment space
  • entropy is additive :
  • variational problem for A(θ )
  • using coordinate ascent :

Variational MF for Variational MF for Ising Ising

slide-30
SLIDE 30
  • M

Mtr

tr is a non-convex inner approximation

  • optimizing over M

Mtr

tr must then yield a lower bound

Variational MF for Variational MF for Ising Ising

M M tr ⊂

what causes this funky curvature?

slide-31
SLIDE 31
  • suppose we have a tree G = (V,T)
  • useful factorization for trees
  • entropy becomes
  • singleton terms
  • pairwise terms

Factorization with Trees Factorization with Trees

Mutual Information

slide-32
SLIDE 32
  • pretend entropy factorizes like a tree (Bethe approximation)
  • define pseudo marginals

Variational MF for Variational MF for Loopy Loopy Graphs Graphs

must impose these normalization and marginalization constraints

  • define local polytope L(G) obeying these constraints
  • note that for any G

with equality only for trees : M(G) = L(G)

) ( ) ( G L G M ⊆

slide-33
SLIDE 33

L(G) is an outer polyhedral approximation

solving this Bethe Variational Problem we get the LBP eqs !

Variational MF for Variational MF for Loopy Loopy Graphs Graphs

so fixed points of LBP are the stationary points of the BVP this not only illuminates what was originally an educated “hack” (LBP) but suggests new convergence conditions and improved algorithms (TRW)

slide-34
SLIDE 34

see ICML see ICML’ ’2008 Tutorial 2008 Tutorial

slide-35
SLIDE 35
  • SMF can also be cast in terms of “Free Energy” etc
  • Tightening the var bound = min KL divergence
  • Other schemes (e.g, “Variational Bayes”) = SMF
  • with additional conditioning (hidden, visible, parameter)
  • Solving variational problem gives both µ and A(θ)
  • Helps to see problems through lens of Var Analysis

Summary Summary

slide-36
SLIDE 36

Matrix of Inference Methods

EP, variational EM, VB, NBP, Gibbs EP, EM, VB, NBP, Gibbs EKF, UKF, moment matching (ADF) Particle filter Other Loopy BP Gibbs Jtree = sparse linear algebra BP = Kalman filter Gaussian Loopy BP, mean field, structured variational, EP, graph-cuts Gibbs VarElim, Jtree, recursive conditioning BP = forwards Boyen-Koller (ADF), beam search Discrete High treewidth Low treewidth Chain (online) Exact Deterministic approximation Stochastic approximation

BP = Belief Propagation, EP = Expectation Propagation, ADF = Assumed Density Filtering, EKF = Extended Kalman Filter, UKF = unscented Kalman filter, VarElim = Variable Elimination, Jtree= Junction Tree, EM = Expectation Maximization, VB = Variational Bayes, NBP = Non-parametric BP

by Kevin Murphy

slide-37
SLIDE 37

T H T H E E E N E N D D