Probabilistic Graphical Models Probabilistic Graphical Models
Exponential family & Variational Inference I
Siamak Ravanbakhsh Fall 2019
Probabilistic Graphical Models Probabilistic Graphical Models - - PowerPoint PPT Presentation
Probabilistic Graphical Models Probabilistic Graphical Models Exponential family & Variational Inference I Siamak Ravanbakhsh Fall 2019 Learning objectives Learning objectives entropy exponential family distribution duality in
Siamak Ravanbakhsh Fall 2019
I(X = x)
I(X = x) = 0 ⇔ P(X = x) = 1 A = a ⊥ B = b ⇒ I(A = a, B = b) = I(A = a) + I(B = b)
I(X = x)
I(X = x) = 0 ⇔ P(X = x) = 1 A = a ⊥ B = b ⇒ I(A = a, B = b) = I(A = a) + I(B = b) I(X = x) ≜ log(
) =P(X=x) 1
− log(P(X = x))
I(X = x) ≜ − log(P(X = x))
H(P) ≜ E[I(X)] = −
P(X =∑x∈V al(X) x) log(P(X = x))
I(X = x) ≜ − log(P(X = x))
H(P) ≜ E[I(X)] = −
P(X =∑x∈V al(X) x) log(P(X = x))
achieves its maximum for uniform distribution
0 ≤ H(P) ≤ log(∣V al(X)∣)
e.g., using Huffman coding
alternatively
V al(X) = {a, b, c, d, e, f}
P(a) =
, P(b) =2 1
, P(c) =4 1
, P(d) =8 1
, P(e) =16 1
P(f) =
32 1
an optimal code for transmitting X:
e.g., using Huffman coding
alternatively
V al(X) = {a, b, c, d, e, f}
P(a) =
, P(b) =2 1
, P(c) =4 1
, P(d) =8 1
, P(e) =16 1
P(f) =
32 1
H(P) = −
log( ) −2 1 2 1
log( ) −4 1 4 1
log( ) −8 1 8 1
log( ) −16 1 16 1
log( ) =16 1 32 1
1
16 15
2 1 2 1 8 3 4 1 16 5 an optimal code for transmitting X:
a → 0 b → 10 c → 110 d → 1110 e → 11110 f → 11111
average length?
contribution to the average length from X=a
e.g., using Huffman coding
alternatively
X ∼ p
H(p, q) ≜ −
p(x) log(q(x))∑x∈V al(X)
negative of the optimal code length for X=x according to q
cross entropy
X ∼ p
H(p, q) ≜ −
p(x) log(q(x))∑x∈V al(X)
negative of the optimal code length for X=x according to q
Kullback-Leibler divergence or relative entorpy cross entropy
N 1
16 microstates: position of 4 particles in top/bottom box 5 macrostates: indistinguishable states assuming exchangeable particles
with we can assume 5 different distributions
16 microstates: position of 4 particles in top/bottom box
V al(X) = {top, bottom}
5 macrostates: indistinguishable states assuming exchangeable particles
p(top) = 1/2 p(top) = 1/4 p(top) = 3/4 p(top) = 0 p(top) = 1
with we can assume 5 different distributions
16 microstates: position of 4 particles in top/bottom box
V al(X) = {top, bottom}
5 macrostates: indistinguishable states assuming exchangeable particles each macrostate is a distribution
p(top) = 1/2 p(top) = 1/4 p(top) = 3/4 p(top) = 0 p(top) = 1
with we can assume 5 different distributions
16 microstates: position of 4 particles in top/bottom box
V al(X) = {top, bottom}
5 macrostates: indistinguishable states assuming exchangeable particles each macrostate is a distribution
p(top) = 1/2 p(top) = 1/4 p(top) = 3/4 p(top) = 0 p(top) = 1
which distribution is more likely?
with we can assume 5 different distributions
16 microstates: position of 4 particles in top/bottom box
V al(X) = {top, bottom}
5 macrostates: indistinguishable states assuming exchangeable particles entropy of a macrostate: (normalized) log number of its microstates each macrostate is a distribution
p(top) = 1/2 p(top) = 1/4 p(top) = 3/4 p(top) = 0 p(top) = 1
which distribution is more likely?
H
=macrostate
ln( ) =N 1 N
!N !t b
N!
ln(N!) − ln(N !) − ln(N !))N 1 ( t b
)
assume a large number of particles N
entropy of a macrostate: (normalized) log number of its microstates
p(top) = 1/2 p(top) = 1/4 p(top) = 3/4 p(top) = 0 p(top) = 1
H
=macrostate
ln( ) =N 1 N
!N !t b
N!
ln(N!) − ln(N !) − ln(N !))N 1 ( t b
)
assume a large number of particles N
≃ N ln(N) − N
entropy of a macrostate: (normalized) log number of its microstates
p(top) = 1/2 p(top) = 1/4 p(top) = 3/4 p(top) = 0 p(top) = 1
H
=macrostate
ln( ) =N 1 N
!N !t b
N!
ln(N!) − ln(N !) − ln(N !))N 1 ( t b
)
assume a large number of particles N
≃ N ln(N) − N
= c −
ln( ) −N N
t
N N
t
ln( )N N
b
N N
b
entropy of a macrostate: (normalized) log number of its microstates
p(top) = 1/2 p(top) = 1/4 p(top) = 3/4 p(top) = 0 p(top) = 1
H
=macrostate
ln( ) =N 1 N
!N !t b
N!
ln(N!) − ln(N !) − ln(N !))N 1 ( t b
)
assume a large number of particles N
≃ N ln(N) − N
= c −
ln( ) −N N
t
N N
t
ln( )N N
b
N N
b
P(X = top) entropy of a macrostate: (normalized) log number of its microstates
p(top) = 1/2 p(top) = 1/4 p(top) = 3/4 p(top) = 0 p(top) = 1
= −
p(x) ln(p(x))∑x∈{top,bottom}
for continuous domains
V al(X)
p(x)dx =(i+1)Δ
i
Δ
i i
i i ignore
∃x
∈i
(Δi, Δ(i + 1))
High entropy distribution:
more information in observing X ~ p it's a more likely "macrostate" the least amount of assumption about p
when optimizing for p(x) subject to constrains, maximize the entropy arg max
H(p)p
E
[ϕ (X)] =p k
μ ∀k
k
p(x) > 0 ∀x
p(x)dx =∫V al(X) 1
High entropy distribution:
more information in observing X ~ p it's a more likely "macrostate" the least amount of assumption about p
when optimizing for p(x) subject to constrains, maximize the entropy arg max
H(p)p
E
[ϕ (X)] =p k
μ ∀k
k
p(x) > 0 ∀x
p(x)dx =∫V al(X) 1
p(x) ∝ exp(
θ ϕ (x))∑k
k k Lagrange multipliers
High entropy distribution:
more information in observing X ~ p it's a more likely "macrostate" the least amount of assumption about p
base measure sufficient statistics log-partition function
A(θ) = ln(
h(x)exp( θ ϕ (x))dx)∫V al(X) ∑k
k k
the inner product of two vectors
D
2
σ2 μ 2σ2 −1
2
(ln(2πσ ) +2 1 2
)σ2 μ2
2
exp(− )2πσ2 1 2σ2 (x−μ)2
2
2
x
simply define to be the new ?
natural parameters
η(θ)
D
θ
simply define to be the new ?
natural parameters
η(θ)
D
θ
can absorb it as a sufficient stat. with θ = 1
σ2 μ 2σ2 −1
2
(ln(θ /π) +2 −1 2
)?2θ
2
θ
1 2
x
x
x
x
1 2
1
2
1−μ μ
x
log(1 + e )
θ
1−μ μ
x
log(1 + e )
θ
μ
1
μ
2
μ
1
μ
D
d I(x=d)
[α − 1, β − 1] [ln(x), ln(1 − x)]
p(x; α, β) =
x(1 −
Γ(α+β) Γ(α)Γ(β) α−1
x)β−1
+
image: wikipedia
motivation: when discussing Bayesian inference
probability of x events happening in a fixed period events happen independently with the rate similar to binomial with large number of trials
x! λ e
x −λ
(rate parameter)
(λ ≈ nμ)
ln(λ) x
x! λ e
x −λ
exp(θ) x! 1
image: wikipedia
time between events in Poisson dist. memoryless property
−λx
V al(X) = R+
number of Bernoulli trials until success memoryless property
V al(X) = N
p(x, k; μ) = (1 − μ) μ where 0 <
k−1
μ < 1
(1 − μ) ≡ e−λ
−λ x
− ln(−θ)
image: wikipedia
for i = j this encodes the local field
i,j i j
i
2D Ising grid
image: wainwright&jordan
image: wainwright&jordan
p(x, y) = p(x)p(y ∣ x)
[I(x = 1), … , I(x = D)]
[y, y ]
2
1 D σ
1 2
μ
1
σ
D 2
μ
D
σ
1 2
−1 σ
D 2
−1 natural params for each component in the mixture
more genral forms
cliques in the the undirected graph
k k k
ln(
exp(− θ ϕ (D )))∑x∈V al(X) ∑k
k k k
familiar log-sum-exp form
image: Michael Jordan's draft
k k k
image: Michael Jordan's draft
I(X
=1
0, X
=2
0) I(X
=1
1, X
=2
0) I(X
=1
1, X
=2
1) I(X
=1
0, X
=2
1)
θ
1,2,0,0
θ
1,2,1,0
θ
1,2,0,1
θ
1,2,1,1
μ
=1,2,0,0
P(X
=1
0, X
=2
0) μ
=1,2,1,0
P(X
=1
1, X
=2
0) μ
=1,2,0,1
P(X
=1
0, X
=2
1) μ
=1,2,1,1
P(X
=1
1, X
=2
1)
sufficient statistics natural params. mean parameters
p
p
θ
any distribution p mean parameter space
M is also convex
p
θ
−1
−1
1
2
p
θ
M, Θ
−1
−1
1
2
Σ − μμT
p
image: wainwright &jordan
V al(X)
M = {E
[ϕ(x)]∀p} =
p
conv{(0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 1)}
image: wainwright &jordan
X
, X ∈1 2
{0, 1} μ
=1
E[X
], μ =1 2
E[X
], μ =2 1,2
E[X
X ]1 2
X
1
X
2
I[X
=1
1], I[X
=2
1], I(X
=1
1, X
=2
1)
p
θ
ϕ
(x) =k
I(x
=i
r, x
=j
s)
p
θ
ϕ
(x) =k
I(x
=i
r, x
=j
s)
X
, X , … , X ∼1 2 n
p
θ
p
θ
μ ^
ϕ(X )n 1 ∑i=1 n i
θ s.t. E
[ϕ(x)] =p
θ
μ ^
I
q∈Q
q
I
q∈Q
M
q∈Q
q
p mode-seeking behavior
1
1
1 1
1
1
1 1
M
M 1
M
M 1
1
1
1 1
M
M 1
I
I
I 1
I 1
mode-seeking behavior
M
M 1
k
M
p(x )k
and otherwise unrestricted
k
M
p(x )k
D(p∥q) = E
[ln p(x)] −p
E [ln q(x )]∑k
p k
and otherwise unrestricted
k
M
p(x )k
D(p∥q) = E
[ln p(x)] −p
E [ln q(x )]∑k
p k
= E
[ln ] +p
p(x )∏k
k
p(x)
E [ln ]∑k
p q(x
)k
p(x
)k
and otherwise unrestricted
k
M
p(x )k
D(p∥q) = E
[ln p(x)] −p
E [ln q(x )]∑k
p k
= E
[ln ] +p
p(x )∏k
k
p(x)
E [ln ]∑k
p q(x
)k
p(x
)k
= D(p∥q ) +
M
D(p(x )∥q(x ))∑k
k k
and otherwise unrestricted
θ
q
θ
p
θ
q
θ
p
consider two distributions: has the same moments as has different moments
q
θ
q
θ′
p
θ
q
θ
p
D(p∥q
) −θ′
D(p∥q
) =θ
⟨E
[ϕ(x)], θ −p
θ ⟩ −
′
A(θ) + A(θ )
′ consider two distributions: has the same moments as has different moments
q
θ
q
θ′
p
= ⟨E
[ϕ(x)], θ −q
θ
θ ⟩ −
′
A(θ) + A(θ ) =
′
D(q
∥q )θ θ′
θ
q
θ
p
D(p∥q
) −θ′
D(p∥q
) =θ
⟨E
[ϕ(x)], θ −p
θ ⟩ −
′
A(θ) + A(θ )
′
consider two distributions: has the same moments as has different moments
q
θ
q
θ′
p
so is the projection
q
θ
= ⟨E
[ϕ(x)], θ −q
θ
θ ⟩ −
′
A(θ) + A(θ ) =
′
D(q
∥q )θ θ′
θ
q
θ
p
D(p∥q
) −θ′
D(p∥q
) =θ
⟨E
[ϕ(x)], θ −p
θ ⟩ −
′
A(θ) + A(θ )
′
(note that p can have any form)
consider two distributions: has the same moments as has different moments
q
θ
q
θ′
p
so is the projection
q
θ
arg min
D(q∥p) =q∈Q
arg min
E [− ln(p)] −q∈Q q
H(q)
exponential family form:
A(θ) = max ⟨μ, θ⟩ −
μ∈M
A (μ)
∗ negative entropy negative energy
ideas based on moment-matching are also applied to inference
arg min
D(q∥p) =q∈Q
arg min
E [− ln(p)] −q∈Q q
H(q)
exponential family form:
A(θ) = max ⟨μ, θ⟩ −
μ∈M
A (μ)
∗ negative entropy negative energy
but we saw that M-projection gives correct marginals, why use I-projection?
A (μ) =
∗
max
⟨μ, θ⟩ −θ∈Θ
A(θ)
ideas based on moment-matching are also applied to inference
arg min
D(q∥p) =q∈Q
arg min
E [− ln(p)] −q∈Q q
H(q)
exponential family form:
A(θ) = max ⟨μ, θ⟩ −
μ∈M
A (μ)
∗ negative entropy negative energy
arg min
D(p∥q) =q∈Q
E
[− ln(q)]p likelihood
but we saw that M-projection gives correct marginals, why use I-projection?
aka moment matching
θ
p (x)ϕ(x)dx =θ
θ
p (x)ϕ(x)dx =θ
∗
θ∈Θ
μ∈M
∗
θ(μ)
image: wainwright &jordan
Θ
A(θ)
Θ = ℜ
Θ
A(θ)
Θ = ℜ
∇
A(θ) =θ
=1+exp(θ) exp(θ)
μ
mean parameter
∗
θ∈ℜ
Θ
A(θ)
Θ = ℜ
∇
A(θ) =θ
=1+exp(θ) exp(θ)
μ
mean parameter
∗
θ∈ℜ
ln(1−μ) ln(μ)
Θ
A(θ)
Θ = ℜ
∇
A(θ) =θ
=1+exp(θ) exp(θ)
μ
mean parameter
backward mapping
∗
θ∈ℜ
ln(1−μ) ln(μ)
Θ
A(θ)
Θ = ℜ
∇
A(θ) =θ
=1+exp(θ) exp(θ)
μ
mean parameter
A (μ) =
∗
μ ln(μ) + (1 − μ) ln(1 − μ)
backward mapping
1 2
1 1
2
1
2
1
2 where μ
=1
∇
A(θ )θ 1
does not depend on takes the form of a Bregman divergence
μ
∈M1
1 2
μ
∈M1
1 2
∗ 1
2
image: wainwright &jordan
μ
1
familiar optimization!
not symmetric, which one to use? is this the "right" one?
θ → μ
image: wainwright &jordan
μ∈M
∗
M
e.g., gives us marginals in the Ising model
Θ
image: wainwright &jordan
μ∈M
∗
∇
A(θ)θ
M
e.g., gives us marginals in the Ising model
Θ
image: wainwright &jordan
μ∈M
∗
∇
A(θ)θ
M
e.g., gives us marginals in the Ising model
Θ