Probabilistic Graphical Models Probabilistic Graphical Models - - PowerPoint PPT Presentation

probabilistic graphical models probabilistic graphical
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Graphical Models Probabilistic Graphical Models - - PowerPoint PPT Presentation

Probabilistic Graphical Models Probabilistic Graphical Models Learning with partial observations Siamak Ravanbakhsh Fall 2019 Learning objectives Learning objectives different types of missing data learning with missing data and hidden vars:


slide-1
SLIDE 1

Probabilistic Graphical Models Probabilistic Graphical Models

Learning with partial observations

Siamak Ravanbakhsh Fall 2019

slide-2
SLIDE 2

Learning objectives Learning objectives

different types of missing data learning with missing data and hidden vars: directed models undirected models develop an intuition for expectation maximization variational interpretation

slide-3
SLIDE 3

Two settings for partial observations Two settings for partial observations

missing data

each instance in is missing some values

D

slide-4
SLIDE 4

Two settings for partial observations Two settings for partial observations

missing data

each instance in is missing some values

hidden variables variables that are never observed

D

why model hidden variables?

image credit: Murphy's book

effect

  • riginal causes

mediating cause

slide-5
SLIDE 5

Two settings for partial observations Two settings for partial observations

missing data

each instance in is missing some values

hidden variables variables that are never observed

D

why model hidden variables?

image credit: Murphy's book

effect

  • riginal causes

mediating cause

latent variable models

  • bservations have common cause

widely used in machine learning

slide-6
SLIDE 6

Missing data Missing data

  • bservation mechanism:

generate the data point decide the values to observe

X = [X

, … , X ]

1 D

O

=

X

[1, 0, … , 0, 1]

hide observe

slide-7
SLIDE 7

Missing data Missing data

  • bservation mechanism:

generate the data point decide the values to observe

X = [X

, … , X ]

1 D

O

=

X

[1, 0, … , 0, 1]

hide observe

  • bserve while is missing

X

  • X

h

(X = [X

; X ])

h

slide-8
SLIDE 8

Missing data Missing data

  • bservation mechanism:

generate the data point decide the values to observe

X = [X

, … , X ]

1 D

O

=

X

[1, 0, … , 0, 1]

hide observe

missing completely at random (MCAR)

throw to generate P(X, O

) =

X

P(X)P(O

)

X

p(x) = θ (1 −

x

θ)1−x

  • bserve while is missing

X

  • X

h

(X = [X

; X ])

h

slide-9
SLIDE 9

Missing data Missing data

  • bservation mechanism:

generate the data point decide the values to observe

X = [X

, … , X ]

1 D

O

=

X

[1, 0, … , 0, 1]

hide observe

missing completely at random (MCAR)

throw to generate throw to decide show/hide P(X, O

) =

X

P(X)P(O

)

X

p(x) = θ (1 −

x

θ)1−x p(o) = ψ (1 −

  • ψ)1−o
  • bserve while is missing

X

  • X

h

(X = [X

; X ])

h

slide-10
SLIDE 10

Learning with MCAR Learning with MCAR

missing completely at random (MCAR)

throw to generate throw to decide show/hide P(X, O) = P(X)P(O)

p(x) = θ (1 −

x

θ)1−x p(o) = ψ (1 −

  • θ)1−o
slide-11
SLIDE 11

Learning with MCAR Learning with MCAR

missing completely at random (MCAR)

throw to generate throw to decide show/hide P(X, O) = P(X)P(O)

p(x) = θ (1 −

x

θ)1−x p(o) = ψ (1 −

  • θ)1−o
  • bjective: learn a model for X, from the data D = {x
, … , x }
  • (1)
  • (M)

each may include values for a different subset of vars.

x

slide-12
SLIDE 12

Learning with MCAR Learning with MCAR

missing completely at random (MCAR)

throw to generate throw to decide show/hide P(X, O) = P(X)P(O)

p(x) = θ (1 −

x

θ)1−x p(o) = ψ (1 −

  • θ)1−o
  • bjective: learn a model for X, from the data D = {x
, … , x }
  • (1)
  • (M)

ℓ(D, θ) =

log p(x , x )

∑x

∈D
  • ∑x

h

  • h

since , we can ignore the obs. patterns

  • ptimize:

each may include values for a different subset of vars.

x

  • P(X, O) = P(X)P(O)
slide-13
SLIDE 13

A more general criteria A more general criteria

missing at random (MAR)

O

X

X

∣X

h

  • if there is information about the obs. pattern in

then it is also in

O

X

X

h

X

slide-14
SLIDE 14

A more general criteria A more general criteria

missing at random (MAR)

O

X

X

∣X

h

  • if there is information about the obs. pattern in

then it is also in

O

X

X

h

throw the thumb-tack twice if hide

  • therwise show

X = [X

, X ]

1 2

X

=

2

1

X

1

X

1

X

  • missing at random

missing completely at random

example

slide-15
SLIDE 15

no "extra" information in the obs. pattern > ignore it

A more general criteria A more general criteria

missing at random (MAR)

O

X

X

∣X

h

  • if there is information about the obs. pattern in

then it is also in

O

X

X

h

throw the thumb-tack twice if hide

  • therwise show

X = [X

, X ]

1 2

X

=

2

1

X

1

X

1

X

  • missing at random

missing completely at random

ℓ(D, θ) =

log p(x , x )

∑x

∈D
  • ∑x

h

  • h
  • ptimize:

example

slide-16
SLIDE 16

Likelihood function Likelihood function

fully observed data: directed: likelihood decomposes undirected: does not decompose, but it is concave for partial observations marginal

slide-17
SLIDE 17

Likelihood function Likelihood function

fully observed data: directed: likelihood decomposes undirected: does not decompose, but it is concave partially observed: does not decompose not convex anymore for partial observations marginal

slide-18
SLIDE 18

Likelihood function Likelihood function

fully observed data: directed: likelihood decomposes undirected: does not decompose, but it is concave partially observed: does not decompose not convex anymore for partial observations

ℓ(D, θ) =

log p(x , x )

∑x

∈D
  • ∑x

h

  • h

marginal

likelihood for a single assignment to the latent vars.

slide-19
SLIDE 19

Likelihood function: Likelihood function: example example

for a directed model marginal

x y z

=

log p(x) +

∑x

log p(y∣x) +

∑x,y

log p(z∣x)

∑x,z

fully observed case decomposes:

ℓ(D, θ) =

log p(x, y, z)

∑x,y,z∈D

slide-20
SLIDE 20

Likelihood function: Likelihood function: example example

for a directed model marginal

x y z

=

log p(x) +

∑x

log p(y∣x) +

∑x,y

log p(z∣x)

∑x,z

fully observed case decomposes: x is always missing (e.g., in a latent variable model)

ℓ(D, θ) =

log p(x, y, z)

∑x,y,z∈D ℓ(D, θ) =

log p(x)p(y∣x)p(z∣x)

∑y,z∈D ∑x

cannot decompose it!

slide-21
SLIDE 21

Parameter learning Parameter learning with missing data with missing data

  • ption 1: obtain the gradient of marginal likelihood
  • ption 2: expectation maximization (EM)

variational interpretation

Directed models:

slide-22
SLIDE 22

Parameter learning Parameter learning with missing data with missing data

  • ption 1: obtain the gradient of marginal likelihood
  • ption 2: expectation maximization (EM)

variational interpretation

Directed models: undirected models:

  • btain the gradient of marginal likelihood

EM is not a good option here

slide-23
SLIDE 23

Parameter learning Parameter learning with missing data with missing data

  • ption 1: obtain the gradient of marginal likelihood
  • ption 2: expectation maximization (EM)

variational interpretation

Directed models: undirected models:

  • btain the gradient of marginal likelihood

EM is not a good option here all of these options need inference for each step of learning

slide-24
SLIDE 24

log marginal likelihood:

ℓ(D) =

log p(a)p(b)p(c∣a, b)p(d∣c)

∑(a,d)∈D ∑b,c

hidden

Gradient of the Gradient of the marginal

marginal likelihood

likelihood

(directed models)

example

slide-25
SLIDE 25

log marginal likelihood:

ℓ(D) =

log p(a)p(b)p(c∣a, b)p(d∣c)

∑(a,d)∈D ∑b,c

ℓ(D) =

∂p(d ∣c )

′ ′

p(d , c ∣a, d)

p(d ∣c )

′ ′

1

∑(a,d)∈D

′ ′ take the derivative:

hidden

need inference for this

Gradient of the Gradient of the marginal

marginal likelihood

likelihood

(directed models)

example

slide-26
SLIDE 26

log marginal likelihood:

ℓ(D) =

log p(a)p(b)p(c∣a, b)p(d∣c)

∑(a,d)∈D ∑b,c

ℓ(D) =

∂p(d ∣c )

′ ′

p(d , c ∣a, d)

p(d ∣c )

′ ′

1

∑(a,d)∈D

′ ′ take the derivative:

hidden

need inference for this what happens to this expression if every variable is observed?

Gradient of the Gradient of the marginal

marginal likelihood

likelihood

(directed models)

example

slide-27
SLIDE 27

Gradient of the Gradient of the marginal

marginal likelihood

likelihood

for a Bayesian Network with CPT

ℓ(D) =

∂p(x

∣pa )

i x i

p(x ∣pa ∣x )

p(x

∣pa )

i x i

1

∑x

∈D
  • i

x

i

  • run inference for each observation

some specific assignment (directed models)

slide-28
SLIDE 28

Gradient of the Gradient of the marginal

marginal likelihood

likelihood

for a Bayesian Network with CPT

ℓ(D) =

∂p(x

∣pa )

i x i

p(x ∣pa ∣x )

p(x

∣pa )

i x i

1

∑x

∈D
  • i

x

i

  • a technical issue:

gradient is always non-negative

no constraint of the form reparametrize (e.g., using softmax)

  • r use Lagrange multipliers

run inference for each observation some specific assignment

p(x∣pa ) =

∑x

x

1

(directed models)

slide-29
SLIDE 29

Gradient of the Gradient of the marginal

marginal likelihood

likelihood

for a Bayesian Network with CPT

ℓ(D) =

∂p(x

∣pa )

i x i

p(x ∣pa ∣x )

p(x

∣pa )

i x i

1

∑x

∈D
  • i

x

i

  • a technical issue:

gradient is always non-negative

no constraint of the form reparametrize (e.g., using softmax)

  • r use Lagrange multipliers

run inference for each observation some specific assignment

p(x∣pa ) =

∑x

x

1

ℓ(D; θ) =

∂θ ∂

∑(c ,d )∈D

′ ′

∂p(d ∣c )

′ ′

∂ℓ(D) ∂θ ∂p(d ∣c )

′ ′

for other parametrizations (beyond simple CPTs) use the chain rule:

(directed models)

slide-30
SLIDE 30

Expectation Maximization Expectation Maximization

E-step: for each use the current parameters to get the marginals

hidden

θ

a, d ∈ D

more generally: expected sufficient statistics

(directed models)

example

slide-31
SLIDE 31

Expectation Maximization Expectation Maximization

E-step: for each use the current parameters to get the marginals

hidden

θ

a, d ∈ D p

(B), p (A), p (C), p (A, B, C), p (D, C)

θ,D θ,D θ,D θ,D θ,D

more generally: expected sufficient statistics

(directed models)

example

slide-32
SLIDE 32

Expectation Maximization Expectation Maximization

E-step: for each use the current parameters to get the marginals

hidden

θ

a, d ∈ D p

(B), p (A), p (C), p (A, B, C), p (D, C)

θ,D θ,D θ,D θ,D θ,D

p

(C =

θ,D

c , D =

d ) =

p (c , d ∣a, d)

N 1 ∑(a,d)∈D θ ′ ′

in general we need inference to estimate this sufficient statistics more generally: expected sufficient statistics

d =

d

nonzero for

(directed models)

example

slide-33
SLIDE 33

Expectation Maximization Expectation Maximization

E-step: for each use the current parameters to get the marginals

hidden

expected sufficient statistics

θ

M-step: use the marginals (similar to completely observed data) to learn

a, d ∈ D p

(B), p (A), p (C), p (A, B, C), p (D, C)

θ,D θ,D θ,D θ,D θ,D

p

(C =

θ,D

c , D =

d ) =

p (c , d ∣a, d)

N 1 ∑(a,d)∈D θ ′ ′

in general we need inference to estimate this sufficient statistics

θ

more generally: expected sufficient statistics

d =

d

nonzero for

(directed models)

example

slide-34
SLIDE 34

Expectation Maximization Expectation Maximization

E-step: for each use the current parameters to get the marginals

hidden

expected sufficient statistics

θ

M-step: use the marginals (similar to completely observed data) to learn

a, d ∈ D p

(B), p (A), p (C), p (A, B, C), p (D, C)

θ,D θ,D θ,D θ,D θ,D

p

(C =

θ,D

c , D =

d ) =

p (c , d ∣a, d)

N 1 ∑(a,d)∈D θ ′ ′

in general we need inference to estimate this sufficient statistics

θ

more generally: expected sufficient statistics

θ

C∣D

E.g., update using p

(C, D)

θ,D

θ

=

D∣C new p

(C)

θ,D

p

(C,D)

θ,D

d =

d

nonzero for

p

(C)

θ,D

and

(directed models)

example

slide-35
SLIDE 35

E-step: for each use the current parameters to get the marginals

θ

x

  • D

{p

(X ), p (X , Pa )}

θ,D i θ,D i X

i

for a Bayesian Network with CPT

Expectation Maximization Expectation Maximization

(directed models)

slide-36
SLIDE 36

E-step: for each use the current parameters to get the marginals

θ

M-step: use the marginals (similar to completely observed data) to learn

x

  • D

{p

(X ), p (X , Pa )}

θ,D i θ,D i X

i

θnew

θ

=

X

∣Pa

i X i

new p

(Pa )

θ,D X i

p

(X ,Pa )

θ,D i X i

for a Bayesian Network with CPT

Expectation Maximization Expectation Maximization

(directed models)

slide-37
SLIDE 37

E-step: for each use the current parameters to get the marginals

θ

M-step: use the marginals (similar to completely observed data) to learn

x

  • D

{p

(X ), p (X , Pa )}

θ,D i θ,D i X

i

θnew

θ

=

X

∣Pa

i X i

new p

(Pa )

θ,D X i

p

(X ,Pa )

θ,D i X i

for a Bayesian Network with CPT

Expectation Maximization Expectation Maximization

(directed models)

for undirected models: M-step is the expensive part perform E-step within each iteration of M-step: equivalent to gradient descent

slide-38
SLIDE 38

Expectation Maximization: Expectation Maximization: example example

1000 training instances 50% of variables are observed (in each instance)

fast initial improvement alarm network

slide-39
SLIDE 39

Expectation Maximization: Expectation Maximization: example example

1000 training instances 50% of variables are observed (in each instance)

fast initial improvement change in different parameter values train log-likelihood test log-likelihood

slide-40
SLIDE 40

Expectation Maximization: Expectation Maximization: example example

local optima in EM:

number of local maxima effect of multiple restarts alarm network a single hidden variable

slide-41
SLIDE 41

Expected log-likelihood Expected log-likelihood

(directed models)

Original objective: ℓ(D, θ) =

log p (x , x )

∑x

∈D
  • ∑x

h

θ

  • h

p

(x )

θ

slide-42
SLIDE 42

Expected log-likelihood Expected log-likelihood

(directed models)

Original objective: ℓ(D, θ) =

log p (x , x )

∑x

∈D
  • ∑x

h

θ

  • h

EM iteration:

p

(x )

θ

  • E
[log p (x , x )]

∑x

∈D
  • p
(x ∣x )

θ h

  • θ
  • h

maximizes the expected log-likelihood

E-step: soft-complete the data M-step: maximize the full likelihood

slide-43
SLIDE 43

Expected log-likelihood Expected log-likelihood

(directed models)

Original objective: ℓ(D, θ) =

log p (x , x )

∑x

∈D
  • ∑x

h

θ

  • h

EM iteration:

p

(x )

θ

  • E
[log p (x , x )]

∑x

∈D
  • p
(x ∣x )

θ h

  • θ
  • h

maximizes the expected log-likelihood

E-step: soft-complete the data M-step: maximize the full likelihood

how are these objectives related? any guarantees for EM? variational interpretation relates these two

slide-44
SLIDE 44

Variational interpretation Variational interpretation of EM

  • f EM

Recall: variational inference

D

(q(x)∣p(x)) =

KL

−H(q) − E

[log p(x)])

q

min

q

slide-45
SLIDE 45

Variational interpretation Variational interpretation of EM

  • f EM

Recall: variational inference

D

(q(x)∣p(x)) =

KL

−H(q) − E

[log p(x)])

q

p(x) =

Z

(x)

p ~

min

q

slide-46
SLIDE 46

Variational interpretation Variational interpretation of EM

  • f EM

Recall: variational inference

D

(q(x)∣p(x)) =

KL

−H(q) − E

[log p(x)])

q

p(x) =

Z

(x)

p ~

min

q

= −H(q) − E

[log (x)]) +

q

p ~ log Z

  • variational free energy
slide-47
SLIDE 47

Variational interpretation Variational interpretation of EM

  • f EM

Recall: variational inference

D

(q(x)∣p(x)) =

KL

−H(q) − E

[log p(x)])

q

p(x) =

Z

(x)

p ~

for a latent variable model

p(x

h

x

) =
  • p(x
)
  • p(x
,x )

h

  • min

q

= −H(q) − E

[log (x)]) +

q

p ~ log Z

  • variational free energy
slide-48
SLIDE 48

D

(q(x )∣p(x ∣x )) =

KL h h

  • −H(q) − E
[log p(x ∣

q h

x

)]
  • Variational interpretation

Variational interpretation of EM

  • f EM

Recall: variational inference

D

(q(x)∣p(x)) =

KL

−H(q) − E

[log p(x)])

q

p(x) =

Z

(x)

p ~

for a latent variable model

p(x

h

x

) =
  • p(x
)
  • p(x
,x )

h

  • min

q

min

q

= −H(q) − E

[log (x)]) +

q

p ~ log Z

  • variational free energy
slide-49
SLIDE 49

D

(q(x )∣p(x ∣x )) =

KL h h

  • −H(q) − E
[log p(x ∣

q h

x

)]
  • Variational interpretation

Variational interpretation of EM

  • f EM

Recall: variational inference

D

(q(x)∣p(x)) =

KL

−H(q) − E

[log p(x)])

q

p(x) =

Z

(x)

p ~

for a latent variable model

p(x

h

x

) =
  • p(x
)
  • p(x
,x )

h

  • E
[log p(x , x )] −

q h

  • log p(x
)
  • min

q

min

q

= −H(q) − E

[log (x)]) +

q

p ~ log Z

  • variational free energy
slide-50
SLIDE 50

D

(q(x )∣p(x ∣x )) =

KL h h

  • −H(q) − E
[log p(x , x )] −

q h

  • log p(x
)
  • Variational interpretation

Variational interpretation of EM

  • f EM

for a latent variable model

slide-51
SLIDE 51

D

(q(x )∣p(x ∣x )) =

KL h h

  • −H(q) − E
[log p(x , x )] −

q h

  • log p(x
)
  • Variational interpretation

Variational interpretation of EM

  • f EM

for a latent variable model

re-arrange

log p (x

) =

θ

  • D
(q(x )∣p (x ∣x )) +

KL h θ h

  • H(q) + E [log p
(x , x )]

q θ h

slide-52
SLIDE 52

D

(q(x )∣p(x ∣x )) =

KL h h

  • −H(q) − E
[log p(x , x )] −

q h

  • log p(x
)
  • Variational interpretation

Variational interpretation of EM

  • f EM

for a latent variable model

re-arrange

log p (x

) =

θ

  • D
(q(x )∣p (x ∣x )) +

KL h θ h

  • H(q) + E [log p
(x , x )]

q θ h

  • riginal objective
slide-53
SLIDE 53

D

(q(x )∣p(x ∣x )) =

KL h h

  • −H(q) − E
[log p(x , x )] −

q h

  • log p(x
)
  • Variational interpretation

Variational interpretation of EM

  • f EM

for a latent variable model

re-arrange

log p (x

) =

θ

  • D
(q(x )∣p (x ∣x )) +

KL h θ h

  • H(q) + E [log p
(x , x )]

q θ h

  • riginal objective

expected log-likelihood wrt q

slide-54
SLIDE 54

D

(q(x )∣p(x ∣x )) =

KL h h

  • −H(q) − E
[log p(x , x )] −

q h

  • log p(x
)
  • Variational interpretation

Variational interpretation of EM

  • f EM

for a latent variable model

re-arrange

log p (x

) =

θ

  • D
(q(x )∣p (x ∣x )) +

KL h θ h

  • H(q) + E [log p
(x , x )]

q θ h

  • riginal objective

expected log-likelihood wrt q

slide-55
SLIDE 55

D

(q(x )∣p(x ∣x )) =

KL h h

  • −H(q) − E
[log p(x , x )] −

q h

  • log p(x
)
  • Variational interpretation

Variational interpretation of EM

  • f EM

for a latent variable model

re-arrange

log p (x

) =

θ

  • D
(q(x )∣p (x ∣x )) +

KL h θ h

  • H(q) + E [log p
(x , x )]

q θ h

  • riginal objective

expected log-likelihood wrt q ignored by EM

slide-56
SLIDE 56

D

(q(x )∣p(x ∣x )) =

KL h h

  • −H(q) − E
[log p(x , x )] −

q h

  • log p(x
)
  • Variational interpretation

Variational interpretation of EM

  • f EM

for a latent variable model

re-arrange

log p (x

) =

θ

  • D
(q(x )∣p (x ∣x )) +

KL h θ h

  • H(q) + E [log p
(x , x )]

q θ h

  • riginal objective

expected log-likelihood wrt q

Coordinate ascent: E-step: optimize q for a fixed (variational inference) M-step: optimize for a fixed q

θ θ

ignored by EM

slide-57
SLIDE 57

guaranteed improvement of

EM as coordinate ascent EM as coordinate ascent

Coordinate ascent: E-step: optimize q for a fixed M-step: optimize for a fixed q

θ θ

log p

(x )

θ

slide-58
SLIDE 58

guaranteed improvement of

EM as coordinate ascent EM as coordinate ascent

Coordinate ascent: E-step: optimize q for a fixed M-step: optimize for a fixed q

θ θ

log p

(x )

θ

slide-59
SLIDE 59

guaranteed improvement of converges to a local optimum

EM as coordinate ascent EM as coordinate ascent

Coordinate ascent: E-step: optimize q for a fixed M-step: optimize for a fixed q

θ θ

log p

(x )

θ

slide-60
SLIDE 60

Amortized inference Amortized inference in latent variable models

in latent variable models

log p (x

) =

θ

  • D
(q(x )∣p (x ∣x )) +

KL h θ h

  • H(q) + E [log p
(x , x )]

q θ h

slide-61
SLIDE 61

Amortized inference Amortized inference in latent variable models

in latent variable models

evidence lower bound (ELBO) is a lower-bound on the likelihood

log p (x

) =

θ

  • D
(q(x )∣p (x ∣x )) +

KL h θ h

  • H(q) + E [log p
(x , x )]

q θ h

slide-62
SLIDE 62

Amortized inference Amortized inference in latent variable models

in latent variable models

evidence lower bound (ELBO) is a lower-bound on the likelihood

instead of

amortization: make q a function of observations

log p (x

) =

θ

  • D
(q(x )∣p (x ∣x )) +

KL h θ h

  • H(q) + E [log p
(x , x )]

q θ h

  • q
(x ∣

ψ h

x

)
  • q(x
)

h

slide-63
SLIDE 63

Amortized inference Amortized inference in latent variable models

in latent variable models

evidence lower bound (ELBO) is a lower-bound on the likelihood

instead of

amortization: make q a function of observations

log p (x

) =

θ

  • D
(q(x )∣p (x ∣x )) +

KL h θ h

  • H(q) + E [log p
(x , x )]

q θ h

  • q
(x ∣

ψ h

x

)
  • p
(x , x ) =

θ h

  • p
(x )p (x ∣x )

θ h θ

  • h

q(x

)

h

slide-64
SLIDE 64

Amortized inference Amortized inference in latent variable models

in latent variable models

evidence lower bound (ELBO) is a lower-bound on the likelihood

instead of

amortization: make q a function of observations

log p (x

) =

θ

  • D
(q(x )∣p (x ∣x )) +

KL h θ h

  • H(q) + E [log p
(x , x )]

q θ h

  • q
(x ∣

ψ h

x

)
  • p
(x , x ) =

θ h

  • p
(x )p (x ∣x )

θ h θ

  • h

−D

(q (x ∣

KL ψ h

x

)∣p (x )) +
  • θ

h

E

[log p (x ∣x )]

q

ψ

θ

  • h

q(x

)

h

slide-65
SLIDE 65

Amortized inference Amortized inference in latent variable models

in latent variable models

evidence lower bound (ELBO) is a lower-bound on the likelihood

instead of

amortization: make q a function of observations

log p (x

) =

θ

  • D
(q(x )∣p (x ∣x )) +

KL h θ h

  • H(q) + E [log p
(x , x )]

q θ h

  • q
(x ∣

ψ h

x

)
  • p
(x , x ) =

θ h

  • p
(x )p (x ∣x )

θ h θ

  • h

−D

(q (x ∣

KL ψ h

x

)∣p (x )) +
  • θ

h

E

[log p (x ∣x )]

q

ψ

θ

  • h

q(x

)

h

x

h

x

  • p
(x )

θ h

p

(x ∣x )

θ

  • h

q

(x ∣x )

ψ h

  • maximize ELBO by jointly optimizing ψ, θ
slide-66
SLIDE 66

Amortized inference Amortized inference in latent variable models

in latent variable models

evidence lower bound (ELBO) is a lower-bound on the likelihood

instead of

amortization: make q a function of observations

log p (x

) =

θ

  • D
(q(x )∣p (x ∣x )) +

KL h θ h

  • H(q) + E [log p
(x , x )]

q θ h

  • q
(x ∣

ψ h

x

)
  • p
(x , x ) =

θ h

  • p
(x )p (x ∣x )

θ h θ

  • h

−D

(q (x ∣

KL ψ h

x

)∣p (x )) +
  • θ

h

E

[log p (x ∣x )]

q

ψ

θ

  • h

q(x

)

h

x

h

x

  • p
(x )

θ h

p

(x ∣x )

θ

  • h

q

(x ∣x )

ψ h

  • maximize ELBO by jointly optimizing ψ, θ

use neural networks to represent cond. distributions use back propagation for optimization

Variational Auto-Encoder (VAE)

slide-67
SLIDE 67

Undirected models Undirected models with latent variables

with latent variables

linear exponential family p(x; θ) =

exp(⟨θ, ϕ(x)⟩)

Z(θ) 1

gradient in the fully observed setting

ℓ(θ, D) =

θ

∣D∣(E

[ϕ(x)] −

D

E

[ϕ(x)])

p

θ

expectation wrt the data expectation wrt the model recall

slide-68
SLIDE 68

Undirected models Undirected models with latent variables

with latent variables

linear exponential family p(x; θ) =

exp(⟨θ, ϕ(x)⟩)

Z(θ) 1

gradient in the fully observed setting

ℓ(θ, D) =

θ

∣D∣(E

[ϕ(x)] −

D

E

[ϕ(x)])

p

θ

expectation wrt the data expectation wrt the model

partial observation: x = (x

, x )
  • h

not observed recall

slide-69
SLIDE 69

Undirected models Undirected models with latent variables

with latent variables

linear exponential family p(x; θ) =

exp(⟨θ, ϕ(x)⟩)

Z(θ) 1

gradient in the fully observed setting

ℓ(θ, D) =

θ

∣D∣(E

[ϕ(x)] −

D

E

[ϕ(x)])

p

θ

expectation wrt the data expectation wrt the model

partial observation: x = (x

, x )
  • h

not observed

p(x

; θ) =
  • exp(⟨θ, ϕ(x)⟩)

∑x

h Z(θ)

1

marginal likelihood:

recall

slide-70
SLIDE 70

Undirected models Undirected models with latent variables

with latent variables

linear exponential family p(x; θ) =

exp(⟨θ, ϕ(x)⟩)

Z(θ) 1

gradient in the fully observed setting

ℓ(θ, D) =

θ

∣D∣(E

[ϕ(x)] −

D

E

[ϕ(x)])

p

θ

expectation wrt the data expectation wrt the model

partial observation:

ℓ(θ, D) =

θ

∣D∣(E

[ϕ(x)] −

D,θ

E

[ϕ(x)])

p

θ

x = (x

, x )
  • h

not observed

p(x

; θ) =
  • exp(⟨θ, ϕ(x)⟩)

∑x

h Z(θ)

1

marginal likelihood: gradient in the partially obs. case

recall wrt both data and model: we need to do inference to calculate expected sufficient statistics (similar to E-step in EM)

slide-71
SLIDE 71

Example: Example: Restricted Boltzmann Machine (RBM)

Restricted Boltzmann Machine (RBM)

binary RBM:

p(h, v) =

exp( θ v h )

Z(θ) 1

∑i,j

i,j i j

v

, h ∈

i j

{0, 1}

for

data: D = {v

}

(m) m

slide-72
SLIDE 72

Example: Example: Restricted Boltzmann Machine (RBM)

Restricted Boltzmann Machine (RBM)

binary RBM:

p(h, v) =

exp( θ v h )

Z(θ) 1

∑i,j

i,j i j

v

, h ∈

i j

{0, 1}

for

sufficient statistics:

ϕ(v

, h ) =

i j

v

h

i j

data: D = {v

}

(m) m

slide-73
SLIDE 73

Example: Example: Restricted Boltzmann Machine (RBM)

Restricted Boltzmann Machine (RBM)

binary RBM:

p(h, v) =

exp( θ v h )

Z(θ) 1

∑i,j

i,j i j

v

, h ∈

i j

{0, 1}

for

sufficient statistics:

ϕ(v

, h ) =

i j

v

h

i j

data: D = {v

}

(m) m

ℓ(D; θ) =

log exp( θ v h )

∑v∈D ∑h Z(θ)

1

∑i,j

i,j i j

we want to optimize:

slide-74
SLIDE 74

Example: Example: Restricted Boltzmann Machine (RBM)

Restricted Boltzmann Machine (RBM)

binary RBM:

p(h, v) =

exp( θ v h )

Z(θ) 1

∑i,j

i,j i j

v

, h ∈

i j

{0, 1}

for

sufficient statistics:

ϕ(v

, h ) =

i j

v

h

i j

data: D = {v

}

(m) m

ℓ(D; θ) =

log exp( θ v h )

∑v∈D ∑h Z(θ)

1

∑i,j

i,j i j

we want to optimize:

ℓ(D; θ) ∝

θ i,j

E

[v h ] −

D,θ i j

E

[v h ]

p

θ

i j

gradient:

= (

v E [h ∣v ]) −

M 1 ∑v

∈D

i ′

i ′ p

θ

j i ′

E

[v h ])

p

θ

i j

slide-75
SLIDE 75

Example: Example: Restricted Boltzmann Machine (RBM)

Restricted Boltzmann Machine (RBM)

binary RBM:

p(h, v) =

exp( θ v h )

Z(θ) 1

∑i,j

i,j i j

v

, h ∈

i j

{0, 1}

for

sufficient statistics:

ϕ(v

, h ) =

i j

v

h

i j

data: D = {v

}

(m) m

ℓ(D; θ) =

log exp( θ v h )

∑v∈D ∑h Z(θ)

1

∑i,j

i,j i j

we want to optimize:

ℓ(D; θ) ∝

θ i,j

E

[v h ] −

D,θ i j

E

[v h ]

p

θ

i j

gradient:

= (

v E [h ∣v ]) −

M 1 ∑v

∈D

i ′

i ′ p

θ

j i ′

E

[v h ])

p

θ

i j sampling-based inference: sample h | v use Gibbs sampling: sample both h,v using current parameters

slide-76
SLIDE 76

summary summary

learning with partial observations: missing data

  • ptimize the likelihood when missing at random

latent variables

can produce expressive probabilistic models

problem is not convex how to learn the model?

directly estimate the gradient (directed and undirected) use EM (directed models) variational interpretation + relation to ELBO

slide-77
SLIDE 77

Example: Gaussian mixture model

model parameters

p(y∣x; {μ

, Σ }) =

k k

exp(− (y −

∣2πΣ

x

1 2 1

μ

) Σ (y −

x T x −1

μ

))

x

x y

p(x; π) =

π

∏k

k I(x=k)

θ = [π, {μ

, Σ }]

k k

slide-78
SLIDE 78

Example: Gaussian mixture model

E-step: calculate for each

model parameters

y ∈ D p(y∣x; {μ

, Σ }) =

k k

exp(− (y −

∣2πΣ

x

1 2 1

μ

) Σ (y −

x T x −1

μ

))

x

x y

p(x; π) =

π

∏k

k I(x=k)

p(x∣y) p(x∣y) ∝ p(x; π)p(y∣x; μ, Σ) = π

N(y; μ , Σ )

k k k

now we have "probabilistically completed" instances update the parameters (easy in a Bayes-net)

θ = [π, {μ

, Σ }]

k k

slide-79
SLIDE 79

Example: Gaussian mixture model

model parameters

p(y∣x; {μ

, Σ }) =

k k

exp(− (y −

∣2πΣ

x

1 2 1

μ

) Σ (y −

x T x −1

μ

))

x

x y

p(x; π) =

π

∏k

k I(x=k)

M-step: estimate π, μ

, Σ ∀k

k k

π

=

k new N 1 ∑y∈D

p(x=k ∣y)

∑k′

p(x=k∣y)

μ

=

k

p(x=k∣y)

∑y∈D

p(x=k∣y)y

∑y∈D mean of a weighted set of instances

Σ

=

k

p(x=k∣y)

∑y∈D

p(x=k∣y)(y−μ )(y−μ )

∑y∈D

k k T

covariance of a weighted set of instances

slide-80
SLIDE 80

Example: Gaussian mixture model

model parameters

p(y∣x; {μ

, Σ }) =

k k

exp(− (y −

∣2πΣ

x

1 2 1

μ

) Σ (y −

x T x −1

μ

))

x

x y

p(x; π) =

π

∏k

k I(x=k)

M-step: estimate π, μ

, Σ ∀k

k k

π

=

k new N 1 ∑y∈D

p(x=k ∣y)

∑k′

p(x=k∣y)

μ

=

k

p(x=k∣y)

∑y∈D

p(x=k∣y)y

∑y∈D mean of a weighted set of instances

Σ

=

k

p(x=k∣y)

∑y∈D

p(x=k∣y)(y−μ )(y−μ )

∑y∈D

k k T

covariance of a weighted set of instances