Undirected Graphical Models
Aaron Courville, Université de Montréal
Undirected Graphical Models Aaron Courville, Universit de Montral 2 - - PowerPoint PPT Presentation
Undirected Graphical Models Aaron Courville, Universit de Montral 2 (UNDIRECTED) GRAPHICAL MODELS Overview : Directed versus undirected graphical models Conditional independence Energy function formalism Maximum likelihood
Aaron Courville, Université de Montréal
2
Overview:
3
assumptions
structure (and conditional independence is useful)
distribution of the set of nodes/variables
Probability theory Graph theory Probabilistic graphical theory
4
5
Definition: X is conditionally independent of Y given Z if the probabil- ity distribution governing X is independent of the value of Y , given the value of Z: for all (i, j, k) P(X = xi, Y = yj | Z = zk) = P(X = xi | Z = zk)P(Y = yj | Z = zk) P(X, Y | Z) = P(X | Z)P(Y | Z) Or equivalently (by the product rule): P(X | Y, Z) = P(X | Z) P(Y | X, Z) = P(Y | Z) Why? Recall from the probability product rule P(X, Y, Z) = P(X | Y, Z)P(Y | Z)P(Z) = P(X | Z)P(Y | Z)P(Z) Example: P(Thunder | Rain, Lightning) = P(Thunder | Lightning)
6
Probabilistic Models Graphical Models Directed Undirected
7
Some conditional independencies cannot be represented by directed graphical models:
(A ⊥ C | B, D) A C B D (A ⊥ C | B, D) (B ⊥ D | A) A C B D (A ⊥ C) (B ⊥ D | A, C) A C B D
Undirected model
8
Sometime its awkward to model phenomena with directed models
Image from “CRF as RNN Semantic Image Segmentation Live Demo” (http://www.robots.ox.ac.uk/~szheng/crfasrnndemo/)
X11 X12 X13 X14 X24 X23 X22 X21 X31 X32 X33 X34 X44 X43 X42 X41 X15 X25 X35 X45 X11 X12 X13 X14 X24 X23 X22 X21 X31 X32 X33 X34 X44 X43 X42 X41 X15 X25 X35 X45
9
separates A and B in the graph.
from A to B in the graph. xA ⊥ xB | xC
X11 X12 X13 X14 X24 X23 X22 X21 X15 X25
A C B
10
which renders x conditionally independent of all other nodes in the graph.
X11 X12 X13 X14 X24 X23 X22 X21 X31 X32 X33 X34 X44 X43 X42 X41 X15 X25 X35 X45
11
X11 X12 X13 X14 X24 X23 X22 X21 X31 X32 X33 X34 X44 X43 X42 X41 X15 X25 X35 X45 X11 X12 X13 X14 X24 X23 X22 X21 X31 X32 X33 X34 X44 X43 X42 X41 X15 X25 X35 X45
neighbours of X23 parents of X23 children of X23 parents of children of
X23
12
Directed graphical models:
A B
P(A | B) P(X1, . . . , XN) =
N
P(Xi | Xparents(i))
13
Undirected graphical models:
called potential function or clique potential) as a nonnegative function where is the set of variables in clique c.
A B
φ(A, B) C c ∈ C φc φc(xc) → R xc
PARAMETERIZING MARKOV NETWORKS: JOINT DISTRIBUTION
14
A C B D
P(a, b, c, d) = 1 Z φ1(a, b)φ2(b, c)φ3(c, d)φ4(d, a) Z =
φ1(a, b)φ2(b, c)φ3(c, d)φ4(d, a) Z =
φc(xc) P(x1, . . . , xn) = 1 Z
φc(xc)
CLIQUES AND MAXIMAL CLIQUES
15
clique
B D C A B D C A
Examples of maximal cliques.
OF GRAPHS AND DISTRIBUTIONS
16
represented with an undirected graph can be parameterize by a product of factors (Hammersley-Clifford theorem).
B D C A B D C A
Examples of maximal cliques.
17
Probabilistic Models Graphical Models
Directed Undirected
RELATING DIRECTED AND UNDIRECTED MODELS
18
undirected graphical model.
➡ Answer: any probability mode whose cond. indep. relations are consistent with
a chordal graph.
B D C A
Not chordal: Chordal:
B D C A B D C A
19
Probabilistic Models Graphical Models Chordal Directed Undirected
20
φ(xc) φ(xc) = exp(−(xc)) (xc) = − log φ(xc) (xc) ∈ R Z =
· · ·
exp [−E(x1, . . . , xn)]
where
E(x1, . . . , xn) is called the energy function. P(x1, . . . , xn) = 1 Z exp (−E(x1, . . . , xn)) = 1 Z exp
c(xc)
parametrization.
function is composed of:
21
c(xc) fc(xc) wc P(x1, . . . , xn) = 1 Z exp
wcfc(xc)
22
decomposes over the cliques does not decompose
wML = argmax
w
log
D
p(x(i); w) = argmax
w D
log φc(x(i)
c ; wc) − log Z(w)
w
D
log φc(x(i)
c ; wc)
w
D
wcfc(x(i)
c )
23
log Z(w) = log
exp
wcfc(xc)
∂wc log Z(w) = ∂ ∂wc log
exp
wcfc(xc)
c wcfc(xc))
= Ep(xc;wc) [fc(xc)]
24
∂ ∂wc
D
log p(x(i); w) = ∂ ∂wc D
wcfc(x(i)
c )
D
fc(x(i)
c )
∂wc log Z(w) = D Ep(data) [fc(xc)] − D Ep(xc;wc) [fc(xc)]
model term
(e.g. fully observable x) data term
(e.g. fully observable x)
25
the partition function contribution of the gradient)?
expectation. ➡This comes with some disadvantages, more on this when we discuss restricted Boltzmann machines. ∂ ∂wc log Z(w) = Ep(xc;wc) [fc(xc)]
An Introduction
27
Energy function:
hidden layer (binary units) visible layer (binary units)
Distribution: p(x, h) = exp(−E(x, h))/Z
x h
W
connections bias
bj
ck
Topics: RBM, visible layer, hidden layer, energy function
partition function (intractable)
E(x, h) = −h>Wx − c>x − b>h = − X
j
X
k
Wj,khjxk − X
k
ckxk − X
j
bjhj
28
Topics: Markov network (with vector nodes)
alternative to the representation as the product of factors
h x
factors
p(x, h) = exp(−E(x, h))/Z = exp(h>Wx + c>x + b>h)/Z = exp(h>Wx) exp(c>x) exp(b>h)/Z
29
Topics: Markov network (with scalar nodes)
within the vectors
1 hH
h2
1 x2
xD
p(x, h) = 1 Z Y
j
Y
k
exp(Wj,khjxk) Y
k
exp(ckxk) Y
j
exp(bjhj)
pair-wise factors
unary factors
30
Energy function:
hidden layer (binary units) visible layer (binary units)
Distribution: p(x, h) = exp(−E(x, h))/Z
x h
W
connections bias
bj
ck
Topics: RBM, visible layer, hidden layer, energy function
partition function (intractable)
E(x, h) = −h>Wx − c>x − b>h = − X
j
X
k
Wj,khjxk − X
k
ckxk − X
j
bjhj
31
x h
p(h|x) =
p(hj|x)
x h p(x|h) =
p(xk|h)
p(xk = 1|h) = 1 1 + exp(−(ck + h⇥W·k))
= sigm(ck + h⇥W·k)
p(hj = 1|x) = 1 1 + exp(−(bj + Wj·x)) = sigm(bj + Wj·x)
Topics: conditional distributions
j th row of W k th column of W
32
p(h|x) = p(x, h)/ X
h0
p(x, h0) = exp(h>Wx + c>x + b>h)/Z P
h02{0,1}H exp(h0>Wx + c>x + b>h0)/Z
= exp(P
j hjWj·x + bjhj)
P
h0
12{0,1} · · · P
h0
H2{0,1} exp(P
j h0 jWj·x + bjh0 j)
= Q
j exp(hjWj·x + bjhj)
P
h0
12{0,1} · · · P
h0
H2{0,1}
Q
j exp(h0 jWj·x + bjh0 j)
= Q
j exp(hjWj·x + bjhj)
⇣P
h0
12{0,1} exp(h0
1W1·x + b1h0 1)
⌘ . . . ⇣P
h0
H2{0,1} exp(h0
HWH·x + bHh0 H)
⌘ = Q
j exp(hjWj·x + bjhj)
Q
j
⇣P
h0
j2{0,1} exp(h0
jWj·x + bjh0 j)
⌘ = Q
j exp(hjWj·x + bjhj)
Q
j (1 + exp(bj + Wj·x))
= Y
j
exp(hjWj·x + bjhj) 1 + exp(bj + Wj·x) = Y
j
p(hj|x)
33
p(hj = 1|x) = exp(bj + Wj·x) 1 + exp(bj + Wj·x) = 1 1 + exp(−bj − Wj·x) = sigm(bj + Wj·x)
34
Topics: free energy
x h
free energy
++++++
p(x) = X
h2{0,1}H
p(x, h) = X
h2{0,1}H
exp(−E(x, h))/Z = exp @c>x +
H
X
j=1
log(1 + exp(bj + Wj·x)) 1 A /Z = exp(−F(x))/Z
35 p(x) = X
h2{0,1}H
exp(h>Wx + c>x + b>h)/Z = exp(c>x) X
h12{0,1}
· · · X
hH2{0,1}
exp @X
j
hjWj·x + bjhj 1 A /Z = exp(c>x) @ X
h12{0,1}
exp(h1W1·x + b1h1) 1 A . . . @ X
hH2{0,1}
exp(hHWH·x + bHhH) 1 A /Z = exp(c>x) (1 + exp(b1 + W1·x)) . . . (1 + exp(bH + WH·x)) /Z = exp(c>x) exp(log(1 + exp(b1 + W1·x))) . . . exp(log(1 + exp(bH + WH·x)))/Z = exp @c>x +
H
X
j=1
log(1 + exp(bj + Wj·x)) 1 A /Z
36
x h
!5 !4 !3 !2 !1 1 2 3 4 5 1 2 3 4 5
softplus(·)
“feature” expected in x bias the prob of each xi bias of each feature
++++++
Topics: free energy
p(x) = exp @c>x +
H
X
j=1
log(1 + exp(bj + Wj·x)) 1 A /Z = exp @c>x +
H
X
j=1
softplus(bj + Wj·x) 1 A /Z
37
log-likelihood (NLL)
positive phase negative phase
hard to compute
Topics: training objective
1 T X
t
l(f(x(t))) = 1 T X
t
− log p(x(t))
∂ − log p(x(t)) ∂θ = Eh ∂E(x(t), h) ∂θ
∂E(x, h) ∂θ
(HINTON, NEURAL COMPUTATION, 2002)
38
x1 xk = ˜ x
∼ p(h|x) ∼ p(x|h)
Topics: contrastive divergence, negative sample
negative sample
˜ x ˜ x
(HINTON, NEURAL COMPUTATION, 2002)
39
E(x, h)
(˜ x, ˜ h)
Topics: contrastive divergence, negative sample
Eh ∂E(x(t), h) ∂θ
h(t)) ∂θ
∂E(x, h) ∂θ
x, ˜ h) ∂θ
h(t))
(HINTON, NEURAL COMPUTATION, 2002)
40
(˜ x, ˜ h) p(x, h)
Topics: contrastive divergence, negative sample
Eh ∂E(x(t), h) ∂θ
h(t)) ∂θ
∂E(x, h) ∂θ
x, ˜ h) ∂θ
h(t))
41
log-likelihood (NLL)
positive phase negative phase
hard to compute
Topics: training objective
1 T X
t
l(f(x(t))) = 1 T X
t
− log p(x(t))
∂ − log p(x(t)) ∂θ = Eh ∂E(x(t), h) ∂θ
∂E(x, h) ∂θ
42
∂E(x, h) ∂θ
θ = Wjk ∂E(x, h) ∂Wjk = ∂ ∂Wjk
⇧
jk
Wjkhjxk − ⇧
k
ckxk − ⇧
j
bjhj ⇥ ⌅ = − ∂ ∂Wjk
Wjkhjxk = −hjxk
Topics: contrastive divergence
43
Eh ⇥∂E(x, h) ∂θ
⇤
θ = Wjk Eh ⇥∂E(x, h) ∂Wjk
⇤ = Eh ⇧ −hjxk
⌃ = ⌅
hj∈{0,1}
−hjxkp(hj|x) h(x)
def =
p(h1=1|x)
... p(hH=1|x)
⇥ = sigm(b + Wx) = −xkp(hj = 1|x)
Topics: contrastive divergence
Eh [rWE(x, h) |x] = h(x) x>
44
˜ x
θ = W
Topics: contrastive divergence
W ( = W α ⇣ rW log p(x(t)) ⌘ ( = W α ⇣ Eh h rWE(x(t), h)
Ex,h [rWE(x, h)] ⌘ ( = W α ⇣ Eh h rWE(x(t), h)
Eh [rWE(˜ x, h) |˜ x] ⌘ ( = W + α ⇣ h(x(t)) x(t)> h(˜ x) ˜ x>⌘
45
Topics: contrastive divergence
k steps of Gibbs sampling, starting at
˜ x
W ⇐ = W + α ⇣ h(x(t)) x(t)> − h(˜ x) ˜ x>⌘ b ⇐ = b + α ⇣ h(x(t)) − h(˜ x) ⌘ c ⇐ = c + α ⇣ x(t) − ˜ x ⌘
(HINTON, NEURAL COMPUTATION, 2002)
46
iterations of Gibbs sampling
estimate of the gradient will be
Topics: contrastive divergence
(TIELEMAN, ICML 2008)
47
the chain to the negative sample of the last iteration
x1 xk = ˜ x
∼ p(h|x) ∼ p(x|h)
comes from the previous iteration
˜ x
Topics: persistent contrastive divergence
) ˜
h(t) = h0
negative sample
48
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
(LAROCHELLE ET AL., JMLR2009) 49
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
50
Energy function:
hidden layer (binary units) visible layer (binary units)
Distribution: p(x, h) = exp(−E(x, h))/Z
x h
W
connections bias
bj
ck
Topics: RBM, visible layer, hidden layer, energy function
partition function (intractable)
E(x, h) = −h>Wx − c>x − b>h = − X
j
X
k
Wj,khjxk − X
k
ckxk − X
j
bjhj
51
x
Topics: Gaussian-Bernoulli RBM
with mean and identity covariance matrix
p(x|h)
2x>x
i xk
(LAROCHELLE ET AL., JMLR2009) 52
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Basic Idea ⇒ Each hidden unit i possesses:
v5 v1 v2 v3 v4 s1h1 s2h2 s3h3 s4h4 v = [v1, ... , vD] s = [s1, ... , sN] h = [h1, ... , hN]
53
p(v, s, h) = 1 Z exp {−E(v, s, h)}
54
E(v, s, h) = −
N
X
i=1
vT Wisihi + 1 2vT Λv + 1 2
N
X
i=1
αis2
i − N
X
i=1
αiµisihi +
N
X
i=1
αiµ2
i h− N
X
i=1
bihi h = [h1, ... , hN] v = [v1, ... , vD] v1 s1h1 v2 v3 v4 v5 s2h2 s3h3 s4h4 s = [s1, ... , sN]
Conditional of visible variables v given h:
☺ Models both mean and covariance of the conditional p(v | h). ☹ Cannot perform efficient block Gibbs sampling:
where Non-diagonal☹
55
p(v | h) = 1 P(h) 1 Z Z exp {−E(v, s, h)} ds = N Cv|h
N
X
i=1
Wiµihi , Cv|h ! Cv|h = Λ −
N
X
i=1
α−1
i hiWiW T i
!−1
P(v | h) P(h | v)
v p(v | h) ⇥= Q
j p(vj | h)
Conditional dist. of the visibles v given s and h:
p(v | s, h) = 1 p(s, h) 1 Z exp {−E(v, s, h)} = N @ Λ +
N
X
i=1
Φihi !−1
N
X
i=1
Wisihi , Λ +
N
X
i=1
Φihi !−11 A Diagonal Covariance☺
Conditional dist. of the slabs s given visibles v and spikes h: Sampling from both p(v | s,h) and p(s | v,h) is simple and efficient.
p(s | v, h) =
N
Y
i=1
p(si | v, h) =
N
Y
i=1
N
i vT Wi + µi
i
56
Conditional of the spike variables h given v: P(h | v) = ∏i P(hi | v)
P(hc
i = 1 | v)
= sigmoid ✓ −1 2
i
2 − bc
i
◆ , P(hm
j = 1 | v)
= sigmoid
j + bm j
✓1 2α−1
i (vT Wi)2 − 1
2vT Φiv + vT Wiµi + bi ◆
linear in v quadratic in v
57
1. 2. 3.
p(v | s, h) = N @ Λ +
N
X
i=1
Φihi !−1
N
X
i=1
Wisihi , Λ +
N
X
i=1
Φihi !−11 A p(s | v, h) =
N
Y
i=1
N
i vT Wi + µi
i
P(h | v) =
N
Y
i = 1
sigmoid ✓1 2α−1
i (v T W i) 2 − 1
2vT Φiv + vT Wiµi + bi ◆
58
Gibbs Sampling: P(h | v) p(s | v, h) p(v | s, h)
h = [h1, ... , hN] v = [v1, ... , vD] v1 s1h1 v2 v3 v4 v5 s2h2 s3h3 s4h4 s = [s1, ... , sN]
Used the convolutional setup of Krizhevsky (2010)
Samples from the Spike-and-slab RBM:
61
Topics: extensions to other observations
Yee Whye Teh and Geoffrey Hinton, 2001
Ruslan Salakhutdinov and Geoffrey Hinton, 2009
George Dahl, Ryan Adam and Hugo Larochelle, 2012