Discrete Markov Random Fields the Inference story Pradeep Ravikumar - - PowerPoint PPT Presentation
Discrete Markov Random Fields the Inference story Pradeep Ravikumar - - PowerPoint PPT Presentation
Discrete Markov Random Fields the Inference story Pradeep Ravikumar Graphical Models, The History How to model stochastic processes of the world? I want to model the world, and I like graphs... 2 History Mid to Late Twentieth Century
Graphical Models, The History
How to model stochastic processes of the world? I want to model the world, and I like graphs...
2
History
Mid to Late Twentieth Century Pioneering work of Conspiracy Theorists The System, it is all connected...
3
History
Late Twentieth Century: people realize that existing scientific literature offers a marriage between probability theory and graph theory – which can be used to model the world.
4
History
Common Misconception: Called graphical models after Grafici Modeles, a sculptor protege of Da Vinci. Called Graphical Models because it models stochastic systems using graphs.
5
History
Common Misconception: Called graphical models after Grafici Modeles, a sculptor protege of Da Vinci. Called Graphical Models because it models stochastic systems using graphs.
6
Graphical Models
X1 X2 X3 X4
7
Graphical Models
X1 X2 X3 X4
8
Graphical Models
X1 X2 X3 X4
Separating Set ∼ (X2, X3) disconnects X1 and X4
9
Graphical Models
X1 X2 X3 X4
Separating Set ∼ (X2, X3) disconnects X1 and X4 Global Markov Property ∼ X1 ⊥ X4 | (X2, X3)
10
Graphical Models
MP(G) ∼ Set of all Markov properties, by ranging over separating sets of G. P represented by G ∼ P satisfies MP(G)
G(X) P1(X) P2(X) P3(X) P4(X)
11
Hammersley and Clifford Theorem
Positive P over X satisfies MP(G) iff P factorizes according to cliques C in G, P(X) = 1 Z
- C∈C
ψC(XC) Specific member of family specified by weights over cliques.
G(X) P2(X) P4(X) P1(X)
P3(X)
{ψ(3)
C (XC)} 12
Exponential Family
p(X) = 1 Z
- C∈C
ψC(XC) = exp(
- C∈C
log ψC(XC) − log Z) Exponential family: p(X; θ) = exp
- α∈C θαφα(X) − Ψ(θ)
- {φα} ∼ features
- {θα} ∼ parameters
- Ψ(θ) ∼ log partition function
13
Inference
Answering queries about the graphical model probability distribution.
14
Inference
For undirected model p(x; θ) = exp
- α∈I θαφα(x) − Ψ(θ)
- key
inference problems are:
⊲ compute log partition function (normalization constant) Ψ(θ) ⊲ marginals p(xA) = P
xv,v∈A p(x)
⊲ most probable configurations x∗ = arg max x p(x | xL)
These problems are intractable in full generality.
15
Log Partition Function
Z = log
- x
- α∈I
ψα(xα)
16
Variable Elimination
Z = log
- x
- α∈I
ψα(xα)
- x
- α
ψα(xα) =
- {xj=i}
- xi
- α
ψα(xα) =
- {xj=i}
- α∈C\i
ψα(xα)
- xi
- α∈Ci
ψα(xα) =
- {xj=i}
- α∈C\i
ψα(xα)g(xj=i)
17
Variable Elimination
Z =
- {xj=i}
- α∈C\i
ψα(xα)g(xj=i) Continue to “eliminate” other variables xj. Is this a linear time method then?
18
Variable Elimination
Z =
- {xj=i}
- α∈C\i
ψα(xα)g(xj=i) Continue to “eliminate” other variables xj. Is this a linear time method then? g(xj=i) depends on variables j which share a factor with i.
19
Variable Elimination
Z = log
- x
- α∈I
ψα(xα)
ψ15 ψ25 ψ57 ψ67 ψ46 ψ36 x1 x2 x3 x4 x5 x6 x7 20
Variable Elimination
Z = log
- x
- α∈I
ψα(xα)
ψ15 ψ25 ψ57 ψ67 ψ46 ψ36 x1 x2 x3 x4 x5 x6 x7 21
Variable Elimination
Z = log
- x
- α∈I
ψα(xα)
x5 x6 x7 m5 m6 ψ57 ψ67 22
Variable Elimination
Z = log
- x
- α∈I
ψα(xα)
x5 x6 x7 m5 m6 ψ57 ψ67 23
Variable Elimination
Z = log
- x
- α∈I
ψα(xα) Exponential in tree-width.
ψ15 ψ25 ψ57 ψ67 ψ46 ψ36 x1 x2 x3 x4 x5 x6 x7
- x7
- x5
ψ57 (
- x1
ψ15
- x2
ψ25)
- x6
ψ67 (
- 3
ψ36
- x4
ψ46)
24
Inference
p(x; θ) = exp(θ⊤φ(x) − A(θ)) A(θ) ∼ log partition function
25
Inference
A(θ) = log
- x
exp(θ⊤φ(x)) ≤ B(θ, λ) ≥ C(θ, λ) λ ∼ “variational” parameter A(θ) ≤ inf
λ B(θ, λ)
≥ sup
λ
C(θ, λ) Summing over configurations → Optimization!
26
Inference
But... (there’s always a but!) is there a principled way to obtain “parametrized” bounds B(θ, λ) and C(θ, λ)?
27
Fenchel Duality
f(x) ∼ concave function Define f ∗(λ) = minx{λ⊤x − f(x)}. = ⇒ f
′(xλ) = λ, Slope of tangent at xλ is λ
Tangent ∼ λ⊤x − (Intercept) = ⇒ f(xλ) = λ⊤xλ − (Intercept) = ⇒ f ∗(λ) ∼ Intercept of line with slope λ tangent to f(x).
28
Fenchel Duality
Tangent ∼ λ⊤x − f ∗(λ) f(x) = minλ{λ⊤x − f ∗(λ)} Thus, g(x, λ) = λ⊤x − f ∗(λ) is an upper bound of f(x)!
29
Fenchel Duality
Let us apply fenchel duality to the log partition function! A(θ) ∼ convex A∗(µ) = sup
θ
(θ⊤µ − A(θ)) A(θ) = sup
µ (θ⊤µ − A∗(µ))
30
Log Partition Function
Define the “marginal polytope” M = {µ ∈ Rd| ∃p(.) s.t.
- x
φ(x)p(x) = µ} M ∼ Convex hull of {φ(x)}
31
Mean parameter mapping
Consider the mapping ∧ : Θ → M, ∧(θ) := Eθ[φ(x)] =
- x
φ(x)p(x; θ) The mapping associates θ to “mean parameters” µ := ∧(θ) ∈ M. Conversely, for µ in Int(M), ∃θ = ∧−1(µ) (unique if exponential family is minimal)
32
Partition function conjugates
A∗(µ) = sup
θ
(θ⊤µ − A(θ)) A(θ) = sup
µ (θ⊤µ − A∗(µ))
Optimal parameters given by, θµ = ∧−1(µ) µθ = ∧(θ)
33
Partition function conjugate
Properties of the fenchel conjugate, A∗(µ) = supθ(θ⊤µ − A(θ))
⊲ A∗(µ) is finite only for µ ∈ M. ⊲ A∗(µ) is the entropy of graphical model distribution with “mean
parameters” µ, or equivalently with parameters ∧−1(µ)!
34
Partition Function
A(θ) = supµ∈M θ⊤µ − A∗(µ) “Hardness” is due to two bottlenecks
- M:
a polytope with exponentially many vertices and no compact representation
- A∗(µ): entropy computation
Approximate either or both!
35
Pairwise Graphical Models
θ12φ12(x1, x2) x2 x1
Overcomplete potentials: Ij(xs) =
- 1
xs = j
- therwise
Ij,k(xs, xt) =
- 1
xs = j and xt = k
- therwise .
p(x|θ) = exp
s,j
θs;jIj(xs) +
- s,t;j,k
θs,t;j,kIj,k(xs, xt) − Ψ(θ)
36
Overcomplete Representation; Mean Parameters
µs;j := Eθ[Ij(xs)] = p(xs = j; θ) µs,t;j,k := Eθ[Ij,k(xs, xt)] = p(xs = j, xt = k; θ) Mean parameters are marginals! Define the following functional forms, µs(xs) =
- j
µs;jIj(xs) µst(xs, xt) =
- j,k
µs,t;j,kIjk(xs, xt)
37
Outer Polytope Approximations
LOCAL(G) := {µ ≥ 0|
- xs
µs(xs) = 1,
- xt
µst(xs, xt) = µs(xs)}
38
Inner Polytope Approximations
For the given graph G and a subgraph H, let E(H) = {θ′ | θ′
st = θst 1(s,t)∈H}
M(G; H) = {µ | µ = Eθ[φ(x)] for some θ ∈ E(H)} . M(G; H) ⊆ M(G)
39
Entropy Approximations
Tree-structured distributions, p(x; µ) =
- s
µs(xs)
- (s,t)∈E
µst(xs, xt) µsµt Define, Hs(µs) :=
- xs
µs(xs) log µs(xs) Ist(µst) :=
- xs,xt
µst(xs, xt) log µst(xs, xt)
40
Tree-structured Entropy, A∗
tree(µ) = −
- s∈V
Hs(µs) +
- (s,t)∈E
Ist(µst) Compact representation; can be used as an approximation.
41
Approximate Inference Techniques
Belief Propagation – Polytope ∼ LOCAL(G), Entropy ∼ Tree- structured entropy! Structured Mean Field – Polytope ∼ M(G; H), Entropy ∼ H- structured entropy Mean Field – H = H0, completely independent graph
42
Divergence Measure View
Given: p(x; θ) ∝ exp(θ⊤φ(x)) Would like a more “manageable” surrogate distribution, q ∈ Q, min
q∈Q D(q(x)||p(x; θ))
43
Divergence Measure View
min
q∈Q D(q(x)||p(x; θ))
D(q||p) = KL(q||p) ∼ Structured Mean Field, Belief Propagation D(p||q) = KL(p||q) ∼ Expectation Propagation (look out for talk on Continuous Markov Random Fields!) Typically approximate KL measure with “energy approximations” (Bethe free energy, Kikuchi free energy) (Ravikumar,Lafferty 05; Preconditioner Approximations) Optimizing for a minimax criterion reduces task to a generalized linear systems
44
problem!
45
Bounds on event probabilities
Doctor: So what is the lower bound on the diagnosis probability? Graphical Model: I don’t know, but here is an “approximate” value. Doctor: =( Can we get upper and lower bounds on p(X ∈ C; θ) (instead
- f just “approximate” values)?
46
Bounds on event probabilities
Classical Chernoff Bounds give useful estimates for i.i.d. random variables. Can they be extended to graphical models? [Ravikumar, Lafferty 04; Variational Chernoff Bounds]
47
Classical Chernoff Bounds
pθ(X ≥ u) ≤ Eθ(X)
u
Markov Inequality pθ(X ≥ u) = pθ(eλX ≥ eλu) ≤ Eθ[eλ(X−u)] From this it follows that: log pθ(X ≥ u) ≤ infλ≥0(−λu + log Eθ[eλX]) Bounds on cumulant function Eθ[eλX] yield the standard Chernoff Bounds.
48
Classical Chernoff Bounds
pθ(X ≥ u) ≤ Eθ(X)
u
Markov Inequality pθ(X ≥ u) = pθ(eλX ≥ eλu) ≤ Eθ[eλ(X−u)] From this it follows that: log pθ(X ≥ u) ≤ infλ≥0(−λu + log Eθ[eλX]) Bounds on cumulant function Eθ[eλX] yield the standard Chernoff Bounds.
49
Classical Chernoff Bounds
pθ(X ≥ u) ≤ Eθ(X)
u
Markov Inequality pθ(X ≥ u) = pθ(eλX ≥ eλu) ≤ Eθ[eλ(X−u)] From this it follows that: log pθ(X ≥ u) ≤ infλ≥0(−λu + log Eθ[eλX]) Bounds on cumulant function Eθ[eλX] yield the standard Chernoff Bounds.
50
Generalized Chernoff Bounds
Event: X ∈ C IC(X) =
- 1
X ∈ C
- therwise
IC(X) ≤ fλ pθ(X ∈ C) ≤ Eθ[fλ]
51
Generalized Chernoff Bounds
Event: X ∈ C IC(X) =
- 1
X ∈ C
- therwise
IC(X) ≤ fλ pθ(X ∈ C) ≤ Eθ[fλ]
52
Graphical Model Chernoff Bounds
With fλ = exp(λ, x + u), we get: log pθ(X ∈ C) ≤ inf
λ (SC(−λ) + log Eθ[e<λ,x>])
where SC(λ) = supx∈C x, λ is the support function of the set C. For an exponential model with sufficient statistic φ(x), the above becomes: log pθ(X ∈ C) ≤ inf
λ (SC,φ(−λ) + Φ(θ + λ) − Φ(θ))
where Φ is the log-partition function and SC,φ(λ) = supx∈C φ(x), λ
53
Graphical Model Chernoff Bounds
With fλ = exp(λ, x + u), we get: log pθ(X ∈ C) ≤ inf
λ (SC(−λ) + log Eθ[e<λ,x>])
where SC(λ) = supx∈C x, λ is the support function of the set C. For an exponential model with sufficient statistic φ(x), the above becomes: log pθ(X ∈ C) ≤ inf
λ (SC,φ(−λ) + Φ(θ + λ) − Φ(θ))
where Φ is the log-partition function and SC,φ(λ) = supx∈C φ(x), λ
54
MAP Estimation
x2 x3 x4 x1 55
MAP Estimation
x2 x3 x4 x1
PROB = 0.01
56
MAP Estimation
x2 x3 x4 x1
PROB = 0.2
57
MAP Estimation
x2 x3 x4 x1
PROB = 0.2
Most Probable Configuration?
58
Polytope View
µ∗ = max
x
θ⊤φ(x) = sup
µ∈M
θ⊤µ
59
Outer Polytope Relaxations
LOCAL(G) := {µ ≥ 0|
- xs
µs(xs) = 1,
- xt
µst(xs, xt) = µs(xs)}
60
Outer Polytope Relaxations
sup
µ∈M(G)
θ⊤µ ≤ sup
µ∈LOCAL(G)
θ⊤µ A Linear Program! (Chekuri, Khanna, Naor, Zosin 05; LP Formulation for Metric Labeling) (Wainwright, Jaakkola, Willsky 05; Tree-reweighted Max-Product, Dual of LP)
61
Inner Polytope Approximations
If MI ⊂ M is any subset of the marginal polytope that includes all of the vertices, µ∗ = max
x
θ, φ = sup
µ∈MI
θ, µ
62
Inner Polytope Approximations
For the given graph G and a subgraph H, let E(H) = {θ′ | θ′
st = θst 1(s,t)∈H}
M(G; H) = {µ | µ = Eθ[φ(x)] for some θ ∈ E(H)} . M(G; H) ⊆ M(G)
63
Inner Polytope Approximations
Mean Field parameters,
M(G; H0) = {µ(s; j), µ(s, j; t, k) | 0 ≤ µ(s; j) ≤ 1, µ(s, j; t, k) = µ(s; j)µ(t; k)} Mean Field Relaxation, sup
µ∈M(G;H0)
θ, µ = sup
µ∈M(G;H0)
- s;j
θs;jµ(s; j) +
- st;jk
θs,j;t,kµ(s, j; t, k) = sup
µ∈M(G;H0)
- s;j
θs;jµ(s; j) +
- st;jk
θs,j;t,kµ(s; j)µ(t; k) Quadratic Program! (Ravikumar, Lafferty 06; Quadratic Relaxations for Metric Labeling and MAP in MRFs)
64
References
⊲ Martin. J. Wainwright and Michael I. Jordan (2003). Graphical models, exponential families, and variational inference. ⊲ M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul (1999). An introduction to variational methods for graphical models. ⊲ Pradeep Ravikumar, John Lafferty (2005). Preconditioner Approximations for Probabilistic Graphical Models. ⊲ Pradeep Ravikumar, John Lafferty (2004). Variational Chernoff Bounds for Graphical Models. ⊲ Chekuri, C., Khanna, S., Naor, J., Zosin, L. (2005). A linear programming formulation and approximation algorithms for the metric labeling problem. ⊲ Pradeep Ravikumar, John Lafferty (2006). Quadratic Programming Relaxations for Metric Labeling and Markov Random Field MAP Estimation.
65