Approximate Inference: Mean Field Methods Probabilistic Graphical - - PDF document

approximate inference mean field methods
SMART_READER_LITE
LIVE PREVIEW

Approximate Inference: Mean Field Methods Probabilistic Graphical - - PDF document

School of Computer Science Approximate Inference: Mean Field Methods Probabilistic Graphical Models (10- Probabilistic Graphical Models (10 -708) 708) Lecture 17, Nov 12, 2007 Receptor A Receptor A X 1 X 1 X 1 Receptor B Receptor B X 2


slide-1
SLIDE 1

1

1

School of Computer Science

Approximate Inference: Mean Field Methods

Probabilistic Graphical Models (10 Probabilistic Graphical Models (10-

  • 708)

708)

Lecture 17, Nov 12, 2007

Eric Xing Eric Xing

Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 X1 X2 X3 X4 X5 X6 X7 X8

Reading: KF-Chap. 12

Eric Xing 2

Questions????

  • Kalman Filters
  • Complex models
  • LBP-Bethe Minimization
slide-2
SLIDE 2

2

Eric Xing 3

Approximate Inference

Eric Xing 4

For a distribution p(X|θ) associated with a complex graph,

computing the marginal (or conditional) probability of arbitrary random variable(s) is intractable

Variational methods

  • formulating probabilistic inference as an optimization problem:

{ }

) ( max arg

*

f F f

f S ∈

=

queries tic probabilis certain to solutions

  • r,
  • n

distributi y probabilit ) (tractable a : f

e.g.

Variational Methods

slide-3
SLIDE 3

3

Eric Xing 5

Exponential representation of graphical models: Includes discrete models, Gaussian, Poisson, exponential,

and many others ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ − =

α α α

α

φ θ ) ( ) ( exp ) | ( θ θ A p

D

X X

⇒ − = ∑ x X X

α α α

α

φ θ ) ( ) (

state

  • f

the as to referred is

energy E

D

{ }

) ( ) ( exp ) | ( θ θ A E p − − = X X

{ }

) ( ) , ( exp

E E H

A E x x X θ, − − =

Exponential Family

=

C c c c

Z P ) ( ) ( X X ψ 1

Eric Xing 6

⎭ ⎬ ⎫ ⎩ ⎨ ⎧ + =

∑ ∑

< i i i j i j i ij

X X X Z X p exp 1 ) ( θ θ

Example: the Boltzmann distribution on atomic lattice

slide-4
SLIDE 4

4

Eric Xing 7 4 8

  • 2
  • 1

1 2

) )( exp( ) exp( 1 + − ≥ µ µ x x ) exp(x

( )

) ( ) ( ) ( ) exp( ) exp( 1 6 3 6 1

2 3

+ − + − + − ≥ µ µ µ µ x x x x ) exp(x

Lower bounds of exponential functions

Eric Xing 8

Lemma: Every marginal distribution q(XH) defines a lower bound of likelihood: where xE denotes observed variables (evidence).

{ } ( ) ( ),

) ( ) , ( ) ( ) ( exp ) (

H E H E H H E

E E A E d p x x x x x x x ′ − − − ′ − ≥ ∫ 1

Upgradeable to higher order bound Upgradeable to higher order bound [Leisink and Kappen, 2000]

Representing q(XH) by exp{-E’(XH)}:

Lower bounding likelihood

slide-5
SLIDE 5

5

Eric Xing 9

Lemma: Every marginal distribution q(XH) defines a lower bound of likelihood: where xE denotes observed variables (evidence).

, ) ( log ) ( ) , ( ) (

) ( q q H H H q E H E

H E C q q d E C p

H

+ − = − − ≥

x x x x X x

X

Representing q(XH) by exp{-E’(XH)}: ,

q q

H E C + − =

entropy : energy expected :

q q

H E energy free Gibbs :

q q

H E −

Lower bounding likelihood

Eric Xing 10

Kullback-Leibler Distance: “Boltzmann’s Law” (definition of “energy”):

z

z p z q z q p q KL ) ( ) ( ln ) ( ) || (

KL and variational (Gibbs) free energy

[ ]

) ( exp ) ( z E C z p − = 1

∑ ∑

+ + ≡

z z

C z q z q z E z q p q KL ln ) ( ln ) ( ) ( ) ( ) || (

Gibbs Free Energy ; minimized when

) ( ) ( Z p Z q = ) (q G

slide-6
SLIDE 6

6

Eric Xing 11

KL and Log Likelihood

Jensen’s inequality KL and Lower bound of likelihood Setting q()=p(z|x) closes the gap (c.f. EM)

∑ ∑ ∑

≥ = = =

z z z

x z q z x p x z q x z q z x p x z q z x p x p x ) | ( ) | , ( log ) | ( ) | ( ) | , ( ) | ( log ) | , ( log ) | ( log ) ; ( θ θ θ θ θ l

) ( ) , ; ( ) ; ( q H z x x

q q c

L l l = + ≥ ⇒ θ θ

∑ ∑ ∑ ∑

+ = = = = =

z z z z

x z p z q z q z q z x p z q x z p z q z q z x p z q x z p z x p z q x z p z x p x p x ) , | ( ) ( log ) ( ) ( ) | , ( log ) ( ) , | ( ) ( ) ( ) | , ( log ) ( ) , | ( ) | , ( log ) ( ) , | ( ) | , ( log ) | ( log ) ; ( θ θ θ θ θ θ θ θ θ θ l

) || ( ) ( ) ; ( p q KL q x + = ⇒ L l θ

ln ( ) p D ln ( ) p D L( ) q L( ) q KL( || ) q p KL( || ) q p

Eric Xing 12

{ } { }

q q Q q q q Q q

H E H E q − = + − =

∈ ∈

min arg max arg

Difficulty: Hq is intractable for general q “solution”: approximate Hq and/or, relax or tighten Q where Q is the equivalent sets of realizable distributions, e.g., all valid parameterizations of exponential family distributions, marginal polytopes [winright et al. 2003].

A variational representation of probability distributions

slide-7
SLIDE 7

7

Eric Xing 13

  • But we do not optimize q(X) explicitly, focus on the set of beliefs
  • e.g.,
  • Relax the optimization problem
  • approximate objective:
  • relaxed feasible set:

The loopy BP algorithm:

  • a fixed point iteration procedure that tries to solve b*

{ }

) ( min arg

*

b F E b

b b

=

∈M

)} ( ), , ( {

, i i j i j i

x b x x b b τ τ = = = ) (b F Hq ≈

  • M

M →

) ( M M ⊇

  • )

, (

, i j i Betha

b b H H =

{ }

∑ ∑

= = ≥ =

i i

x j j i x i

  • x

x x x ) ( ) , ( , ) ( | τ τ τ τ 1 M

Bethe Free Energy/LBP

Eric Xing 14

  • Optimize q(XH) in the space of tractable families
  • i.e., subgraph of Gp over which exact computation of Hq is

feasible

Tightening the optimization space

  • exact objective:
  • tightened feasible set:

q

H T Q →

q q q

H E q − =

min arg

* T ) ( Q T ⊆

Mean field methods

slide-8
SLIDE 8

8

Eric Xing 15

Mean Field Approximation Mean Field Approximation

Eric Xing 16

)] ( [ X p G )}] ( [{

c c X

q G Exact: Clusters:

(intractable)

Cluster-based approx. to the Gibbs free energy

(Wiegerinck 2001, Xing et al 03,04)

slide-9
SLIDE 9

9

Eric Xing 17

Mean field approx. to Gibbs free energy

Given a disjoint clustering, {C1, … , CI}, of all variables Let Mean-field free energy

  • Will never equal to the exact Gibbs free energy no matter what

clustering is used, but it does always define a lower bound of the likelihood

Optimize each qi(xc)'s.

  • Variational calculus …
  • Do inference in each qi(xc) using any tractable algorithm

), ( ) (

i i i

q q

C

X X

=

( ) ( ) ( )

∑∑ ∑∑∏

+ =

i C i C i i C i C i

i C i i i i C i

q q E q G

x x

x x x x ln ) (

MF

( ) ( ) ( ) ( ) ( )

∑∑ ∑∑ ∑∑

+ + =

< i x i i i i x i j i j i x x j i

i i j i

x q x q x x q x x x q x q G ln ) ( ) ( e.g.,

MF

φ φ

(naïve mean field)

Eric Xing 18

) , | ( ) (

, , , , *

i j i i i i

q MB H C E C H C H i

p q

= X x X X

Theorem: The optimum GMF approximation to the

cluster marginal is isomorphic to the cluster posterior of the original distribution given internal evidence and its generalized mean fields:

GMF algorithm: Iterate over each qi

The Generalized Mean Field theorem

slide-10
SLIDE 10

10

Eric Xing 19

[xing et al. UAI 2003]

A generalized mean field algorithm

Eric Xing 20

[xing et al. UAI 2003]

A generalized mean field algorithm

slide-11
SLIDE 11

11

Eric Xing 21

Theorem: The GMF algorithm is guaranteed to

converge to a local optimum, and provides a lower bound for the likelihood of evidence (or partition function) the model.

Convergence theorem

Eric Xing 22

Gibbs predictive distribution:

⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ + + =

∈ −

i

j i j i ij i i i i

A x X X x X p

N

θ θ 0 exp ) | ( }) : { | (

i j i

j x X p N ∈ =

j

x

j

x

mean field equation:

}) : { | ( exp ) (

i j i j i q j i ij i i i i

j X X p A X X X X q

j q i j

N

N

∈ 〉 〈 = ⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ + + =

θ θ 0

j

q j

X } : {

i j

j X

j q

N ∈ 〉 〈 Xi

Approximate p(X) by fully factorized q(X)=Piqi(Xi) For Boltzmann distribution p(X)=exp{∑i < j qijXiXj+qioXi}/Z :

Xi ℑxj〉qj

resembles a “message” sent from node j to i

{〈xj〉qj : j ∈ Ni}

forms the “mean field” applied to Xi from its neighborhood

} : {

i q j

j X

j

N ∈ 〉 〈

j

q j

X

The naive mean field approximation

slide-12
SLIDE 12

12

Eric Xing 23

Cluster marginal of a square block Ck:

⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ + + ∝

∑ ∑ ∑

∈ ∈

∈ ∈ ∈ k k MBC k k MB j k C i k C k k

C j i X q j i ij C i i i j i ij C

X X X X X X q

, ) (

' , , '

exp ) ( θ θ θ

Virtually a reparameterized Ising model of small size.

Generalized MF approximation to Ising models

Eric Xing 24

GMF approximation to Ising models

GMF2x2 GMF4x4 BP

Attractive coupling: positively weighted Repulsive coupling: negatively weighted

slide-13
SLIDE 13

13

Eric Xing 25

Automatic Variational Inference

  • Currently for each new model we have to
  • derive the variational update equations
  • write application-specific code to find the solution
  • Each can be time consuming and error prone
  • Can we build a general-purpose inference engine which

automates these procedures?

... ... ... ... A A A A

x2 x3 x1 xN yk2 yk3 yk1 ykN

... ...

y12 y13 y11 y1N

...

S2 S3 S1 SN

... ... ... ... ... A A A A

x2 x3 x1 xN yk2 yk3 yk1 ykN

... ...

y12 y13 y11 y1N

...

S2 S3 S1 SN

...

fHMM Mean field approx. Structured variational approx.

⇒ ⇒

Eric Xing 26

a general, iterative message passing algorithm clustering completely defines approximation

  • preserves dependencies
  • flexible performance/cost trade-off
  • clustering automatable

recovers model-specific structured VI algorithms, including:

  • fHMM, LDA
  • variational Bayesian learning algorithms

easily provides new structured VI approximations to complex

models

Cluster-based MF (e.g., GMF)

slide-14
SLIDE 14

14

Eric Xing 27

Example 1: Bayesian Gaussian Model

Likelihood function Conjugate priors Factorized variational distribution

mean precision (inverse variance)

Eric Xing 28

Variational Posterior Distribution

where

slide-15
SLIDE 15

15

Eric Xing 29

Initial Configuration

−1 1 1 2 µ τ

(a)

Eric Xing 30

After Updating q(µ)

−1 1 1 2 µ τ

(b)

slide-16
SLIDE 16

16

Eric Xing 31

After Updating q(µ)

−1 1 1 2 µ τ

(c)

Eric Xing 32

Converged Solution

−1 1 1 2 µ τ

(d)

slide-17
SLIDE 17

17

Eric Xing 33

Example 2: Latent Dirichlet Allocation

Blei, Jordan and Ng (2003) Generative model of documents (but broadly applicable e.g.

collaborative filtering, image retrieval, bioinformatics)

Generative model:

  • choose
  • choose topic
  • choose word

N M w z

  • Eric Xing

34

Latent Dirichlet Allocation

Variational approximation Data set:

  • 15,000 documents
  • 90,000 terms
  • 2.1 million words

Model:

  • 100 factors
  • 9 million parameters

MCMC could be totally infeasible for this problem

( ) ( )

) ln , ( | Multi ) , ( | Dir ) ( ) ( ) , ( θ β φ α γ θ θ θ

θ w z

f z z f z q q z q = × = = =

slide-18
SLIDE 18

18

Eric Xing 35 GMFr GMFb BP

Example 3: Sigmoid belief network

Eric Xing 36

Example 4: Factorial HMM