SLIDE 1
Variational Methods for Inference
based on a paper by Michael Jordan et al. Patrick Pletscher
ETH Zurich, Switzerland
16th May 2006
SLIDE 2 The Need for Approximate Methods – FHMM
X (1)
1
X (1)
2
X (1)
3
X (2)
1
X (2)
2
X (2)
3
X (3)
1
X (3)
2
X (3)
3
Y1 Y2 Y3
Inference
P(H|E) = P(H, E) P(E) , complexity O(NM+1T)
SLIDE 3 The Need for Approximate Methods – FHMM
X (1)
1
Y3
Inference
P(H|E) = P(H, E) P(E) , complexity O(NM+1T)
SLIDE 4 The Need for Approximate Methods – FHMM
X (1)
1
Y3
Inference
P(H|E) = P(H, E) P(E) , complexity O(NM+1T)
SLIDE 5
Overview
1 Motivation 2 Variational Methods 3 Discussion
SLIDE 6 Toy Example: ln(x)
Idea of Variational Methods
Characterize a probability distribution as the solution of an
Intro: ln(x) variationally
Although no probability, still useful. Note ln(x) is a concave function. ln(x) = min
λ {λx − ln λ − 1}
ln(x) now a linear function! Price: minimization has to be carried
Upper bounds
For any given x, we have: ln(x) ≤ λx − ln λ − 1, for all λ.
SLIDE 7
Toy Example: ln(x)
−5 −2.5 2.5 5 ln(x) ln(x) 0.5 1 1.5 2 2.5 3 x x
SLIDE 8
Toy Example: ln(x)
−5 −2.5 2.5 5 ln(x) ln(x) 0.5 1 1.5 2 2.5 3 x x
SLIDE 9
Toy Example: ln(x)
x = 1: d dλ{λ· 1 − ln λ − 1} ! = 0 it follows: λ = 1
SLIDE 10
Toy Example: ln(x)
−5 −2.5 2.5 5 ln(x) ln(x) 0.5 1 1.5 2 2.5 3 x x
SLIDE 11
Toy Example: ln(x)
−5 −2.5 2.5 5 ln(x) ln(x) 0.5 1 1.5 2 2.5 3 x x
SLIDE 12
Toy Example: ln(x)
−5 −2.5 2.5 5 ln(x) ln(x) 0.5 1 1.5 2 2.5 3 x x
SLIDE 13
Toy Example: ln(x)
−5 −2.5 2.5 5 ln(x) ln(x) 0.5 1 1.5 2 2.5 3 x x
SLIDE 14
Toy Example: ln(x)
−5 −2.5 2.5 5 ln(x) ln(x) 0.5 1 1.5 2 2.5 3 x x
SLIDE 15
Toy Example: ln(x)
−5 −2.5 2.5 5 ln(x) ln(x) 0.5 1 1.5 2 2.5 3 x x
SLIDE 16
Convex Duality (1/2)
1 Transform function such that it becomes convex or concave.
Transformation has to be invertible.
2 Calculate conjugate function (for concave function f (x))
f (x) = min
λ {λTx − f ∗(λ)},
where f ∗(λ) = min
x {λTx − f (x)} 3 Transform back.
SLIDE 17
Convex Duality (2/2)
−2 2 4 6 1 2 x x λx f (x)
SLIDE 18
Convex Duality and ln(x) Example
minimize: d dx {λx − ln(x)} ! = 0, we get λ − 1 x
!
= 0 → x = 1 λ Finally resubstitute: f ∗(λ) = λ· 1 λ + ln λ = 1 + ln λ Which is the “magical” intercept of the ln example: f (x) = min
λ {λx − ln λ − 1}
SLIDE 19 Approximations using Convex Duality (1/2)
Basic idea
Simplify joint probability distribution by transforming the local probability functions. Usually only for “hard” nodes. Afterwards
- ne can use exact methods.
This might look like this . . .
α θ z w β M N
γ θ φ z M N Figure: Replacing a difficult graphical model by a simpler one. Here for Latent Dirichlet Allocation.
SLIDE 20 Approximations using Convex Duality (2/2)
Joint Distribution
Product of upper bounds is an upper bound: P(S) =
P(Si|Sπ(i)) ≤
PU(Si|Sπ(i), λU
i )
Marginalization
Upper bound for P(E), the likelihood: P(E) =
P(H, E) ≤
PU(Si|Sπ(i), λU
i )
SLIDE 21 Sequential Approach
An unsupervised approach. . .
Algorithm transforms nodes, while needed. Backward-“elimination” popular as graph remains tractable.
Forward
⇒
Backward
⇒
Discussion
- Flexible, out-of-the-box application,
- but: no “insider” knowledge is used.
SLIDE 22 Block Approach
A supervised approach. . .
Designate in advance which nodes are to be transformed.
α θ z w β M N
γ θ φ z M N
Minimize Kullback-Leibler Divergence
λ∗ = arg min
λ D(Q(H|E, λ)P(H|E)),
where D(QP) :=
Q(S) ln Q(S) P(S)
SLIDE 23
FHMM Variationally
X (1)
1
X (1)
2
X (1)
3
X (2)
1
X (2)
2
X (2)
3
X (3)
1
X (3)
2
X (3)
3
Y1 Y2 Y3
SLIDE 24
FHMM Variationally
X (1)
1
X (1)
2
X (1)
3
X (2)
1
X (2)
2
X (2)
3
X (3)
1
X (3)
2
X (3)
3
Y1 Y2 Y3
SLIDE 25 Discussion: some pointers
Quite broad questions . . .
- Does anybody know more about this new dependence,
introduced by the optimization step?
- Any theoretical guarantees?
- Anybody already used variational methods? If so, for what?
Experiences?
Junction Tree algorithm . . .
- Translation from conditional probabilities to clique potentials?
- How do clique potentials change when we introduce the
chords?