Variational Methods for Inference based on a paper by Michael Jordan - - PowerPoint PPT Presentation

variational methods for inference
SMART_READER_LITE
LIVE PREVIEW

Variational Methods for Inference based on a paper by Michael Jordan - - PowerPoint PPT Presentation

Variational Methods for Inference based on a paper by Michael Jordan et al. Patrick Pletscher ETH Zurich, Switzerland 16th May 2006 The Need for Approximate Methods FHMM X (1) X (1) X (1) 1 2 3 X (2) X (2) X (2) 1 2 3 X (3) X (3) X


slide-1
SLIDE 1

Variational Methods for Inference

based on a paper by Michael Jordan et al. Patrick Pletscher

ETH Zurich, Switzerland

16th May 2006

slide-2
SLIDE 2

The Need for Approximate Methods – FHMM

X (1)

1

X (1)

2

X (1)

3

X (2)

1

X (2)

2

X (2)

3

X (3)

1

X (3)

2

X (3)

3

Y1 Y2 Y3

Inference

P(H|E) = P(H, E) P(E) , complexity O(NM+1T)

slide-3
SLIDE 3

The Need for Approximate Methods – FHMM

X (1)

1

Y3

Inference

P(H|E) = P(H, E) P(E) , complexity O(NM+1T)

slide-4
SLIDE 4

The Need for Approximate Methods – FHMM

X (1)

1

Y3

Inference

P(H|E) = P(H, E) P(E) , complexity O(NM+1T)

slide-5
SLIDE 5

Overview

1 Motivation 2 Variational Methods 3 Discussion

slide-6
SLIDE 6

Toy Example: ln(x)

Idea of Variational Methods

Characterize a probability distribution as the solution of an

  • ptimization problem.

Intro: ln(x) variationally

Although no probability, still useful. Note ln(x) is a concave function. ln(x) = min

λ {λx − ln λ − 1}

ln(x) now a linear function! Price: minimization has to be carried

  • ut for each x.

Upper bounds

For any given x, we have: ln(x) ≤ λx − ln λ − 1, for all λ.

slide-7
SLIDE 7

Toy Example: ln(x)

−5 −2.5 2.5 5 ln(x) ln(x) 0.5 1 1.5 2 2.5 3 x x

slide-8
SLIDE 8

Toy Example: ln(x)

−5 −2.5 2.5 5 ln(x) ln(x) 0.5 1 1.5 2 2.5 3 x x

slide-9
SLIDE 9

Toy Example: ln(x)

x = 1: d dλ{λ· 1 − ln λ − 1} ! = 0 it follows: λ = 1

slide-10
SLIDE 10

Toy Example: ln(x)

−5 −2.5 2.5 5 ln(x) ln(x) 0.5 1 1.5 2 2.5 3 x x

slide-11
SLIDE 11

Toy Example: ln(x)

−5 −2.5 2.5 5 ln(x) ln(x) 0.5 1 1.5 2 2.5 3 x x

slide-12
SLIDE 12

Toy Example: ln(x)

−5 −2.5 2.5 5 ln(x) ln(x) 0.5 1 1.5 2 2.5 3 x x

slide-13
SLIDE 13

Toy Example: ln(x)

−5 −2.5 2.5 5 ln(x) ln(x) 0.5 1 1.5 2 2.5 3 x x

slide-14
SLIDE 14

Toy Example: ln(x)

−5 −2.5 2.5 5 ln(x) ln(x) 0.5 1 1.5 2 2.5 3 x x

slide-15
SLIDE 15

Toy Example: ln(x)

−5 −2.5 2.5 5 ln(x) ln(x) 0.5 1 1.5 2 2.5 3 x x

slide-16
SLIDE 16

Convex Duality (1/2)

1 Transform function such that it becomes convex or concave.

Transformation has to be invertible.

2 Calculate conjugate function (for concave function f (x))

f (x) = min

λ {λTx − f ∗(λ)},

where f ∗(λ) = min

x {λTx − f (x)} 3 Transform back.

slide-17
SLIDE 17

Convex Duality (2/2)

−2 2 4 6 1 2 x x λx f (x)

slide-18
SLIDE 18

Convex Duality and ln(x) Example

minimize: d dx {λx − ln(x)} ! = 0, we get λ − 1 x

!

= 0 → x = 1 λ Finally resubstitute: f ∗(λ) = λ· 1 λ + ln λ = 1 + ln λ Which is the “magical” intercept of the ln example: f (x) = min

λ {λx − ln λ − 1}

slide-19
SLIDE 19

Approximations using Convex Duality (1/2)

Basic idea

Simplify joint probability distribution by transforming the local probability functions. Usually only for “hard” nodes. Afterwards

  • ne can use exact methods.

This might look like this . . .

α θ z w β M N

γ θ φ z M N Figure: Replacing a difficult graphical model by a simpler one. Here for Latent Dirichlet Allocation.

slide-20
SLIDE 20

Approximations using Convex Duality (2/2)

Joint Distribution

Product of upper bounds is an upper bound: P(S) =

  • i

P(Si|Sπ(i)) ≤

  • i

PU(Si|Sπ(i), λU

i )

Marginalization

Upper bound for P(E), the likelihood: P(E) =

  • {H}

P(H, E) ≤

  • {H}
  • i

PU(Si|Sπ(i), λU

i )

slide-21
SLIDE 21

Sequential Approach

An unsupervised approach. . .

Algorithm transforms nodes, while needed. Backward-“elimination” popular as graph remains tractable.

Forward

Backward

Discussion

  • Flexible, out-of-the-box application,
  • but: no “insider” knowledge is used.
slide-22
SLIDE 22

Block Approach

A supervised approach. . .

Designate in advance which nodes are to be transformed.

α θ z w β M N

γ θ φ z M N

Minimize Kullback-Leibler Divergence

λ∗ = arg min

λ D(Q(H|E, λ)P(H|E)),

where D(QP) :=

  • {S}

Q(S) ln Q(S) P(S)

slide-23
SLIDE 23

FHMM Variationally

X (1)

1

X (1)

2

X (1)

3

X (2)

1

X (2)

2

X (2)

3

X (3)

1

X (3)

2

X (3)

3

Y1 Y2 Y3

slide-24
SLIDE 24

FHMM Variationally

X (1)

1

X (1)

2

X (1)

3

X (2)

1

X (2)

2

X (2)

3

X (3)

1

X (3)

2

X (3)

3

Y1 Y2 Y3

slide-25
SLIDE 25

Discussion: some pointers

Quite broad questions . . .

  • Does anybody know more about this new dependence,

introduced by the optimization step?

  • Any theoretical guarantees?
  • Anybody already used variational methods? If so, for what?

Experiences?

Junction Tree algorithm . . .

  • Translation from conditional probabilities to clique potentials?
  • How do clique potentials change when we introduce the

chords?