Probabilistic Graphical Models Probabilistic Graphical Models - - PowerPoint PPT Presentation

probabilistic graphical models probabilistic graphical
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Graphical Models Probabilistic Graphical Models - - PowerPoint PPT Presentation

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected models Siamak Ravanbakhsh Fall 2019 Learning objectives Learning objectives the form of likelihood for undirected models why is it difficult to


slide-1
SLIDE 1

Probabilistic Graphical Models Probabilistic Graphical Models

parameter learning in undirected models

Siamak Ravanbakhsh Fall 2019

slide-2
SLIDE 2

Learning objectives Learning objectives

the form of likelihood for undirected models why is it difficult to optimize? conditional likelihood in undirected models different approximations for parameter learning

MAP inference and regularization pseudo likelihood pseudo moment-matching contrastive learning

slide-3
SLIDE 3

Likelihood in MRFs Likelihood in MRFs

example

A B C

probability dist.

I(A = 1, B = 1) I(B = 1, C = 1)

p(A, B, C; θ) =

exp(θ I(A =

Z 1 1

1, B = 1) + θ

I(B =

2

1, C = 1))

slide-4
SLIDE 4

Likelihood in MRFs Likelihood in MRFs

example

A B C

probability dist.

I(A = 1, B = 1) I(B = 1, C = 1)

p(A, B, C; θ) =

exp(θ I(A =

Z 1 1

1, B = 1) + θ

I(B =

2

1, C = 1))

  • bservations

∣D∣ = 100

E

[I(A =

D

1, B = 1)] = .4, E

[I(B =

D

1, C = 1)] = .4

slide-5
SLIDE 5

Likelihood in MRFs Likelihood in MRFs

example

A B C

probability dist.

I(A = 1, B = 1) I(B = 1, C = 1)

p(A, B, C; θ) =

exp(θ I(A =

Z 1 1

1, B = 1) + θ

I(B =

2

1, C = 1))

log-likelihood: log p(D; θ) =

θ I(a =

∑a,b,c∈D

1

1, b = 1) + θ

I(b =

2

1, c = 1) − 100 log Z(θ)

  • bservations

∣D∣ = 100

E

[I(A =

D

1, B = 1)] = .4, E

[I(B =

D

1, C = 1)] = .4

= 40θ

+

1

40θ

2

100 log Z(θ)

slide-6
SLIDE 6

Likelihood in MRFs Likelihood in MRFs

example

A B C

probability dist.

I(A = 1, B = 1) I(B = 1, C = 1)

p(A, B, C; θ) =

exp(θ I(A =

Z 1 1

1, B = 1) + θ

I(B =

2

1, C = 1))

log-likelihood: log p(D; θ) =

θ I(a =

∑a,b,c∈D

1

1, b = 1) + θ

I(b =

2

1, c = 1) − 100 log Z(θ)

  • bservations

∣D∣ = 100

E

[I(A =

D

1, B = 1)] = .4, E

[I(B =

D

1, C = 1)] = .4

= 40θ

+

1

40θ

2

100 log Z(θ)

θ

1

θ

2

because of the partition function

the likelihood does not decompose

log-likelihood function

slide-7
SLIDE 7

Likelihood in Likelihood in linear exponential family

linear exponential family (log-linear models)

(log-linear models)

probability distribution

p(x; θ) =

exp(⟨θ, ϕ(x)⟩)

Z(θ) 1

sufficient statistics

slide-8
SLIDE 8

Likelihood in Likelihood in linear exponential family

linear exponential family (log-linear models)

(log-linear models)

probability distribution

ℓ(D, θ) = log p(D; θ) =

⟨θ, ϕ(x)⟩ −

∑x∈D ∣D∣ log Z(θ)

p(x; θ) =

exp(⟨θ, ϕ(x)⟩)

Z(θ) 1

log-likelihood of

sufficient statistics

D

slide-9
SLIDE 9

Likelihood in Likelihood in linear exponential family

linear exponential family (log-linear models)

(log-linear models)

probability distribution

ℓ(D, θ) = log p(D; θ) =

⟨θ, ϕ(x)⟩ −

∑x∈D ∣D∣ log Z(θ)

p(x; θ) =

exp(⟨θ, ϕ(x)⟩)

Z(θ) 1

log-likelihood of

sufficient statistics

D

ℓ(D, θ) = ∣D∣ ⟨θ, E

[ϕ(x)]⟩ − log Z(θ)

(

D

)

expected sufficient statistics μ

D

slide-10
SLIDE 10

Likelihood in Likelihood in linear exponential family

linear exponential family (log-linear models)

(log-linear models)

probability distribution

ℓ(D, θ) = log p(D; θ) =

⟨θ, ϕ(x)⟩ −

∑x∈D ∣D∣ log Z(θ)

p(x; θ) =

exp(⟨θ, ϕ(x)⟩)

Z(θ) 1

log-likelihood of

sufficient statistics

D

ℓ(D, θ) = ∣D∣ ⟨θ, E

[ϕ(x)]⟩ − log Z(θ)

(

D

)

expected sufficient statistics μ

D

image: Michael Jordan's draft

θ

1,2,0,0

θ

1,2,1,0

θ

1,2,0,1

θ

1,2,1,1 E

[I(X =

D 1

0, X

=

2

0)] = P(X

=

1

0, X

=

2

0)

params.

example

expected sufficient statistics

E

[I(X =

D 1

1, X

=

2

0)] = P(X

=

1

1, X

=

2

0) E

[I(X =

D 1

0, X

=

2

1)] = P(X

=

1

0, X

=

2

1) E

[I(X =

D 1

1, X

=

2

1)] = P(X

=

1

1, X

=

2

1)

slide-11
SLIDE 11

Likelihood in Likelihood in linear exponential family

linear exponential family (log-linear models)

(log-linear models)

probability distribution

ℓ(D, θ) = log p(D; θ) =

⟨θ, ϕ(x)⟩ −

∑x∈D ∣D∣ log Z(θ)

p(x; θ) =

exp(⟨θ, ϕ(x)⟩)

Z(θ) 1

log-likelihood of

sufficient statistics

D

ℓ(D, θ) = ∣D∣ ⟨θ, E

[ϕ(x)]⟩ − log Z(θ)

(

D

)

expected sufficient statistics μ

D

has interesting properties

log Z(θ)

log Z(θ) =

∂θ

i

=

Z(θ)

exp(⟨θ,ϕ(x)⟩)

∂θ i ∂ ∑x

ϕ (x) exp(⟨θ, ϕ(x)⟩) =

Z(θ) 1

∑x

i

E

[ϕ (x)]

p i

log Z(θ) =

θ

E

[ϕ(x)]

θ

so

slide-12
SLIDE 12

Likelihood in Likelihood in linear exponential family

linear exponential family (log-linear models)

(log-linear models)

probability distribution

ℓ(D, θ) = log p(D; θ) =

⟨θ, ϕ(x)⟩ −

∑x∈D ∣D∣ log Z(θ)

p(x; θ) =

exp(⟨θ, ϕ(x)⟩)

Z(θ) 1

log-likelihood of

sufficient statistics

D

ℓ(D, θ) = ∣D∣ ⟨θ, E

[ϕ(x)]⟩ − log Z(θ)

(

D

)

expected sufficient statistics μ

D

has interesting properties

log Z(θ)

log Z(θ) =

∂θ

i

=

Z(θ)

exp(⟨θ,ϕ(x)⟩)

∂θ i ∂ ∑x

ϕ (x) exp(⟨θ, ϕ(x)⟩) =

Z(θ) 1

∑x

i

E

[ϕ (x)]

p i

log Z(θ) =

θ

E

[ϕ(x)]

θ

so

log Z(θ) =

∂θ

∂θ

i j

∂2

E[ϕ

(x)ϕ (x)] −

i j

E[ϕ

(x)]E[ϕ (x)] =

i j

Cov(ϕ

, ϕ )

i j

so the Hessian matrix is positive definite is convex

log Z(θ)

slide-13
SLIDE 13

Likelihood in Likelihood in linear exponential family

linear exponential family (log-linear models)

(log-linear models)

probability distribution

p(x; θ) =

exp(⟨θ, ϕ(x)⟩)

Z(θ) 1

log-likelihood of D

ℓ(D, θ) = ∣D∣ ⟨θ, E

[ϕ(x)]⟩ − log Z(θ)

(

D

)

linear in θ convex

slide-14
SLIDE 14

Likelihood in Likelihood in linear exponential family

linear exponential family (log-linear models)

(log-linear models)

probability distribution

p(x; θ) =

exp(⟨θ, ϕ(x)⟩)

Z(θ) 1

log-likelihood of D

ℓ(D, θ) = ∣D∣ ⟨θ, E

[ϕ(x)]⟩ − log Z(θ)

(

D

)

linear in θ convex concave

slide-15
SLIDE 15

Likelihood in Likelihood in linear exponential family

linear exponential family (log-linear models)

(log-linear models)

probability distribution

p(x; θ) =

exp(⟨θ, ϕ(x)⟩)

Z(θ) 1

log-likelihood of D

ℓ(D, θ) = ∣D∣ ⟨θ, E

[ϕ(x)]⟩ − log Z(θ)

(

D

)

linear in θ convex concave

should be easy to maximize (?)

slide-16
SLIDE 16

Likelihood in Likelihood in linear exponential family

linear exponential family (log-linear models)

(log-linear models)

probability distribution

p(x; θ) =

exp(⟨θ, ϕ(x)⟩)

Z(θ) 1

log-likelihood of D

ℓ(D, θ) = ∣D∣ ⟨θ, E

[ϕ(x)]⟩ − log Z(θ)

(

D

)

linear in θ convex concave

NO!

should be easy to maximize (?)

slide-17
SLIDE 17

Likelihood in Likelihood in linear exponential family

linear exponential family (log-linear models)

(log-linear models)

probability distribution

p(x; θ) =

exp(⟨θ, ϕ(x)⟩)

Z(θ) 1

log-likelihood of D

ℓ(D, θ) = ∣D∣ ⟨θ, E

[ϕ(x)]⟩ − log Z(θ)

(

D

)

linear in θ convex concave

estimating is a difficult inference problem

NO!

Z(θ)

should be easy to maximize (?)

slide-18
SLIDE 18

Likelihood in Likelihood in linear exponential family

linear exponential family (log-linear models)

(log-linear models)

probability distribution

p(x; θ) =

exp(⟨θ, ϕ(x)⟩)

Z(θ) 1

log-likelihood of D

ℓ(D, θ) = ∣D∣ ⟨θ, E

[ϕ(x)]⟩ − log Z(θ)

(

D

)

linear in θ convex concave

estimating is a difficult inference problem how about just using the gradient info? involves inference as well

any combination of inference-gradient based optimization for learning undirected models

NO!

Z(θ) ∇

log Z(θ) =

θ

E

[ϕ(x)]

θ

should be easy to maximize (?)

slide-19
SLIDE 19

Moment matching Moment matching for

for linear exponential family linear exponential family

set its derivative to zero ∇

ℓ(θ, D) =

θ

∣D∣(E

[ϕ(x)] −

D

E

[ϕ(x)]) =

p

θ

probability distribution

p(x; θ) =

exp(⟨θ, ϕ(x)⟩)

Z(θ) 1

log-likelihood of D

ℓ(D, θ) = ∣D∣ ⟨θ, E

[ϕ(x)]⟩ − log Z(θ)

(

D

)

linear in θ convex concave

⇒ E

[ϕ(x)] =

p

θ

E [ϕ(x)]

D

find the parameter that results in the same expected sufficient statistics as the data

θ

slide-20
SLIDE 20

Moment matching Moment matching for

for linear exponential family linear exponential family

set its derivative to zero ∇

ℓ(θ, D) =

θ

∣D∣(E

[ϕ(x)] −

D

E

[ϕ(x)]) =

p

θ

probability distribution

p(x; θ) =

exp(⟨θ, ϕ(x)⟩)

Z(θ) 1

log-likelihood of D

ℓ(D, θ) = ∣D∣ ⟨θ, E

[ϕ(x)]⟩ − log Z(θ)

(

D

)

linear in θ convex concave

⇒ E

[ϕ(x)] =

p

θ

E [ϕ(x)]

D

find the parameter that results in the same expected sufficient statistics as the data

θ

p(X

=

1

0, X

=

2

1; θ) = p

(X =

D 1

0, X

=

2

1)

slide-21
SLIDE 21

Learning needs inference Learning needs inference in an inner loop

in an inner loop

maximizing the likelihood:

gradient

  • ptimality condition

arg max

log p(D∣θ)

θ

∝ E

[ϕ(x)] −

D

E

[ϕ(x)]

p

θ

E

[ϕ(x)] =

D

E

[ϕ(x)]

p

θ

easy to calculate inference in the graphical model

slide-22
SLIDE 22

Learning needs inference Learning needs inference in an inner loop

in an inner loop

maximizing the likelihood:

gradient

  • ptimality condition

arg max

log p(D∣θ)

θ

∝ E

[ϕ(x)] −

D

E

[ϕ(x)]

p

θ

E

[ϕ(x)] =

D

E

[ϕ(x)]

p

θ

easy to calculate inference in the graphical model

example: in discrete pairwise MRF

p

(x , x ) =

D i j

p(x

, x ; θ)

∀i, j ∈

i j

E

empirical marginals marginals in our current model

slide-23
SLIDE 23

Learning needs inference Learning needs inference in an inner loop

in an inner loop

maximizing the likelihood:

gradient

  • ptimality condition

arg max

log p(D∣θ)

θ

∝ E

[ϕ(x)] −

D

E

[ϕ(x)]

p

θ

E

[ϕ(x)] =

D

E

[ϕ(x)]

p

θ

easy to calculate inference in the graphical model

example: in discrete pairwise MRF

p

(x , x ) =

D i j

p(x

, x ; θ)

∀i, j ∈

i j

E

empirical marginals marginals in our current model

what if exact inference is infeasible?

learning with approx. inference often exact optimization of approx. objective use sampling, variational inference ...

slide-24
SLIDE 24

Conditional training Conditional training

generative vs. discriminative training

(D, θ) =

Y ∣X

log p(y∣x)

∑(x,y)∈D

Hidden Markov Model (HMM) trained generatively

Recall

ℓ(D, θ) =

log p(x, y)

∑(x,y)∈D

easy to train the Bayes-net (assuming full observation) the likelihood decomposes Conditional random fields (CRF) trained discriminatively maximizing conditional log-likelihood how to maximize this?

slide-25
SLIDE 25

Conditional training Conditional training

  • bjective: arg max
ℓ (D, θ) =

θ Y ∣X

arg max

log p(y∣x)

θ ∑(x,y)∈D

slide-26
SLIDE 26

Conditional training Conditional training

  • bjective: arg max
ℓ (D, θ) =

θ Y ∣X

arg max

log p(y∣x)

θ ∑(x,y)∈D

again consider the gradient

ℓ (D, θ) =

θ Y ∣X

ϕ(x , y ) −

∑(x ,y )∈D

′ ′

′ ′

E

[ϕ(x , y)]

p(.∣x;θ) ′

conditional expectation of sufficient statistics it is conditioned on the observed x'

slide-27
SLIDE 27

Conditional training Conditional training

  • bjective: arg max
ℓ (D, θ) =

θ Y ∣X

arg max

log p(y∣x)

θ ∑(x,y)∈D

again consider the gradient

ℓ (D, θ) =

θ Y ∣X

ϕ(x , y ) −

∑(x ,y )∈D

′ ′

′ ′

E

[ϕ(x , y)]

p(.∣x;θ) ′

conditional expectation of sufficient statistics it is conditioned on the observed x'

to obtain the gradient: for each instance run inference conditioned on x

(x, y) ∈ D

slide-28
SLIDE 28

Conditional training Conditional training

  • bjective: arg max
ℓ (D, θ) =

θ Y ∣X

arg max

log p(y∣x)

θ ∑(x,y)∈D

again consider the gradient

ℓ (D, θ) =

θ Y ∣X

ϕ(x , y ) −

∑(x ,y )∈D

′ ′

′ ′

E

[ϕ(x , y)]

p(.∣x;θ) ′

conditional expectation of sufficient statistics it is conditioned on the observed x'

to obtain the gradient: for each instance run inference conditioned on x

(x, y) ∈ D

inference on the reduced MRF is easy in this case pro: conditioning could simplify inference con: have to run inference for each datapoint

compared to generative training in undirected models

slide-29
SLIDE 29

Local priors & regularization Local priors & regularization

max-likelihood can lead to over-fitting Bayesian approach:

in Bayes-nets: decomposed prior decomposed posterior in Markov nets: posterior does not decompose (because of the the likelihood doesn't decomposed due to the partition function.)

p(θ) p(θ ∣ D)

slide-30
SLIDE 30

Local priors & regularization Local priors & regularization

max-likelihood can lead to over-fitting Bayesian approach:

in Bayes-nets: decomposed prior decomposed posterior in Markov nets: posterior does not decompose (because of the the likelihood doesn't decomposed due to the partition function.)

p(θ) p(θ ∣ D)

alternative MAP inference: maximize the log-posterior

does not model uncertainty sensitive to parametrization

arg max

log p(D∣θ) +

θ

log p(θ)

serves as a regularization does not have to be conjugate

to a full-Bayesian approach

slide-31
SLIDE 31

Gaussian & Laplace priors Gaussian & Laplace priors

MAP inference: find the maximum of the posterior arg max

log p(D∣θ) +

θ

log p(θ)

p(θ)

the product of univariate Laplacian (L1 reg.) the product of univariate Gaussian (L2 reg.)

slide-32
SLIDE 32

Gaussian & Laplace priors Gaussian & Laplace priors

MAP inference: find the maximum of the posterior arg max

log p(D∣θ) +

θ

log p(θ)

p(θ)

the product of univariate Laplacian (L1 reg.) the product of univariate Gaussian (L2 reg.) p(θ; σ) ∝

exp(− ) ⇒

∏i

2σ2 θ

i 2

log p(θ; σ) = −

θ +

2σ2 1 ∑i i 2

c

L2 regularization penalty term Gaussian prior

slide-33
SLIDE 33

Gaussian & Laplace priors Gaussian & Laplace priors

MAP inference: find the maximum of the posterior arg max

log p(D∣θ) +

θ

log p(θ)

p(θ)

the product of univariate Laplacian (L1 reg.) the product of univariate Gaussian (L2 reg.) p(θ; σ) ∝

exp(− ) ⇒

∏i

2σ2 θ

i 2

log p(θ; σ) = −

θ +

2σ2 1 ∑i i 2

c

L2 regularization penalty term

p(θ; β) =

exp(− ) ⇒

∏i 2β

1 β ∣θ

i

log p(θ; β) = −

∣θ ∣

β 1 ∑i i L1 regularization penalty term

Gaussian prior Laplace prior

sparsity-inducing

slide-34
SLIDE 34

Gaussian & Laplace priors Gaussian & Laplace priors

MAP inference: find the maximum of the posterior arg max

log p(D∣θ) +

θ

log p(θ)

p(θ)

the product of univariate Laplacian (L1 reg.) the product of univariate Gaussian (L2 reg.) p(θ; σ) ∝

exp(− ) ⇒

∏i

2σ2 θ

i 2

log p(θ; σ) = −

θ +

2σ2 1 ∑i i 2

c

L2 regularization penalty term

p(θ; β) =

exp(− ) ⇒

∏i 2β

1 β ∣θ

i

log p(θ; β) = −

∣θ ∣

β 1 ∑i i L1 regularization penalty term

Gaussian prior Laplace prior

both of these penalize large parameter values both reduce fluctuations in the density

sparsity-inducing

log

=

p(x ,θ)

p(x;θ)

θ (ϕ(x) −

T

ϕ(x ))

slide-35
SLIDE 35

Pseudo-moment matching Pseudo-moment matching

we want to set the parameters such that if/when loopy BP converges: p

(A, B) =

D

(A, B; θ), p (B, D) =

p ^

D

(B, D; θ) …

p ^

empirical marginals marginals using BP

θ

ϕ(A, B) ϕ(B, D) ϕ(D, F) ϕ(F, E) ϕ(C, E) ϕ(A, C)

slide-36
SLIDE 36

Pseudo-moment matching Pseudo-moment matching

we want to set the parameters such that if/when loopy BP converges: idea: use the reparametrization in BP p(A, B, C, D, E, F) ∝

(A)… (F)

p ^ p ^

(A,B)… (C,A)

p ^ p ^

p

(A, B) =

D

(A, B; θ), p (B, D) =

p ^

D

(B, D; θ) …

p ^

empirical marginals marginals using BP

θ

product of clique marginals cancel the double-counts ϕ(A, B) ϕ(B, D) ϕ(D, F) ϕ(F, E) ϕ(C, E) ϕ(A, C)

slide-37
SLIDE 37

set the factors using empirical marginals

e.g., each term in the numerator & denominator of should be used exactly once if we run BP on the resulting model we will have

Pseudo-moment matching Pseudo-moment matching

we want to set the parameters such that if/when loopy BP converges: idea: use the reparametrization in BP p(A, B, C, D, E, F) ∝

(A)… (F)

p ^ p ^

(A,B)… (C,A)

p ^ p ^

p

(A, B) =

D

(A, B; θ), p (B, D) =

p ^

D

(B, D; θ) …

p ^

empirical marginals marginals using BP

θ

product of clique marginals cancel the double-counts ϕ(A, B) ϕ(B, D) ϕ(D, F) ϕ(F, E) ϕ(C, E) ϕ(A, C)

ϕ(A, B) ← p

(A, B)/p (A)

D D p

(A, B) =

D

(A, B; θ), p (B, D) =

p ^

D

(B, D; θ) …

p ^

slide-38
SLIDE 38

Pseudo-likelihood Pseudo-likelihood

log-likelihood:

log p(D; θ) =

log p(x ∣x , … , x ; θ)

∑x∈D ∑i

i 1 i−1

using the chain rule

slide-39
SLIDE 39

Pseudo-likelihood Pseudo-likelihood

log-likelihood:

log p(D; θ) =

log p(x ∣x , … , x ; θ)

∑x∈D ∑i

i 1 i−1

using the chain rule

pseudo log-likelihood is an approximation

log p(D; θ) ≈

log p(x ∣x ; θ)

∑x∈D ∑i

i −i

[x

, … , x , x , … , x ]

1 i−1 i+1 n

slide-40
SLIDE 40

Pseudo-likelihood Pseudo-likelihood

log-likelihood:

log p(D; θ) =

log p(x ∣x , … , x ; θ)

∑x∈D ∑i

i 1 i−1

using the chain rule

pseudo log-likelihood is an approximation

log p(D; θ) ≈

log p(x ∣x ; θ)

∑x∈D ∑i

i −i

[x

, … , x , x , … , x ]

1 i−1 i+1 n

= p(x;θ)

∑x

i

p(x;θ)

(x;θ)

∑x

i p

~

(x;θ)

p ~

eliminates the normalization constant

slide-41
SLIDE 41

Pseudo-likelihood Pseudo-likelihood

log-likelihood:

log p(D; θ) =

log p(x ∣x , … , x ; θ)

∑x∈D ∑i

i 1 i−1

using the chain rule

pseudo log-likelihood is an approximation

log p(D; θ) ≈

log p(x ∣x ; θ)

∑x∈D ∑i

i −i

[x

, … , x , x , … , x ]

1 i−1 i+1 n

it simplifies the gradient:

instead of calculating use upshot: only conditional expectations are used (tractable!)

= p(x;θ)

∑x

i

p(x;θ)

(x;θ)

∑x

i p

~

(x;θ)

p ~

eliminates the normalization constant

ϕ (x) −

∑x∈D

k

∣D∣ E

[ϕ (x)]

p

θ

k

ϕ (x) −

∑x∈D

k

E [ϕ (x , x )]

∑i

p(.∣x

)

−i

k i ′ −i expensive!

can be further simplified using Markov blanket for each node...

slide-42
SLIDE 42

Pseudo-likelihood Pseudo-likelihood

log-likelihood:

log p(D; θ) =

log p(x ∣x , … , x ; θ)

∑x∈D ∑i

i 1 i−1

using the chain rule

pseudo log-likelihood is an approximation

log p(D; θ) ≈

log p(x ∣x ; θ)

∑x∈D ∑i

i −i

[x

, … , x , x , … , x ]

1 i−1 i+1 n

it simplifies the gradient:

instead of calculating use upshot: only conditional expectations are used (tractable!)

= p(x;θ)

∑x

i

p(x;θ)

(x;θ)

∑x

i p

~

(x;θ)

p ~

eliminates the normalization constant

ϕ (x) −

∑x∈D

k

∣D∣ E

[ϕ (x)]

p

θ

k

ϕ (x) −

∑x∈D

k

E [ϕ (x , x )]

∑i

p(.∣x

)

−i

k i ′ −i expensive!

can be further simplified using Markov blanket for each node...

at the limit of large data (assuming we have the right model), this is exact!

slide-43
SLIDE 43

Contrastive methods Contrastive methods

log-likelihood:

log p(D; θ) =

log (x; θ) −

∑x∈D p ~ log Z(θ)

increase the unnormalize prob. of the data it's easy to evaluate: e.g, keep the total sum of unnormalized probabilities small sum over exponentially many terms

log

(x; θ)

∑x p ~ ⟨θ, ϕ(x)⟩

slide-44
SLIDE 44

Contrastive methods Contrastive methods

log-likelihood:

log p(D; θ) =

log (x; θ) −

∑x∈D p ~ log Z(θ)

increase the unnormalize prob. of the data it's easy to evaluate: e.g, keep the total sum of unnormalized probabilities small sum over exponentially many terms

log

(x; θ)

∑x p ~ ⟨θ, ϕ(x)⟩

contrastive methods: replace with a tractable alternative

contrastive divergence minimization: only look at a small "neighborhood" of the data margin-based training: consider

  • nly for conditional training

log Z(θ) log max

(x ; θ)

x

=x

′ p

~

slide-45
SLIDE 45

Structure Learning Structure Learning

Conditional independence test

X − Y ⇒ X ⊥ Y ∣ MB(Y ) ∨ X ⊥ Y ∣ MB(X)

similar to finding the undirected skeleton of a Bayes Net bound on the size of Markov Blanket (versus #parents in the BN)

slide-46
SLIDE 46

Structure Learning Structure Learning

Conditional independence test

X − Y ⇒ X ⊥ Y ∣ MB(Y ) ∨ X ⊥ Y ∣ MB(X)

similar to finding the undirected skeleton of a Bayes Net bound on the size of Markov Blanket (versus #parents in the BN)

Maximizing a score:

likelihood score Bayesian score (approx. BIC) these scores do not decompose learn models with low-tree width MAP score (L1 regularized log-likelihood ) convex problem introduce features 1-by-1 until convergence

slide-47
SLIDE 47

Summary Summary

parameter learning in MRFs is difficult normalization constant ties the parameters together likelihood does not decompose Bayesian inference is also difficult

slide-48
SLIDE 48

Summary Summary

parameter learning in MRFs is difficult normalization constant ties the parameters together likelihood does not decompose Bayesian inference is also difficult (conditional) log-likelihood is convex gradient steps: need inference on the current model global optima satisfies moment-matching condition combine inference methods + gradient descent for learning

slide-49
SLIDE 49

Summary Summary

parameter learning in MRFs is difficult normalization constant ties the parameters together likelihood does not decompose Bayesian inference is also difficult (conditional) log-likelihood is convex gradient steps: need inference on the current model global optima satisfies moment-matching condition combine inference methods + gradient descent for learning alternative approaches: pseudo moment matching, pseudo likelihood, contrastive divergence, margin-based training