Probabilistic Graphical Models 10-708 Learning Completely Observed - - PDF document

probabilistic graphical models
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Graphical Models 10-708 Learning Completely Observed - - PDF document

Probabilistic Graphical Models 10-708 Learning Completely Observed Learning Completely Observed Undirected Graphical Models Undirected Graphical Models Eric Xing Eric Xing Lecture 12, Oct 19, 2005 Reading: MJ-Chap. 9,19,20 Recap: MLE for


slide-1
SLIDE 1

1

Probabilistic Graphical Models

10-708

Learning Completely Observed Learning Completely Observed Undirected Graphical Models Undirected Graphical Models

Eric Xing Eric Xing

Lecture 12, Oct 19, 2005 Reading: MJ-Chap. 9,19,20

Recap: MLE for BNs

If we assume the parameters for each CPD are globally

independent, and all nodes are fully observed, then the log- likelihood function decomposes into a sum of local terms, one per node:

∑ ∑ ∏ ∏

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = =

i n i i n n i i i n

i i

x p x p D p D ) , | ( log ) , | ( log ) | ( log ) ; (

, ,

θ θ θ θ

π π

x x l

=

k j i k ij ijk ML ijk

n n

, ' , '

θ

slide-2
SLIDE 2

2

MLE for undirected graphical models

For directed graphical models, the log-likelihood decomposes

into a sum of terms, one per family (node plus parents).

For undirected graphical models, the log-likelihood does not

decompose, because the normalization constant Z is a function of all the parameters

In general, we will need to do inference (i.e., marginalization)

to learn parameters for undirected models, even in the fully

  • bserved case.

=

C c c c n

Z x x P ) ( ) , , ( x ψ 1

1 K

∑ ∏

=

n

x x C c c c

Z

, ,

) (

K

1

x ψ

Log Likelihood for UGMs with tabular clique potentials

Sufficient statistics: for a UGM (V,E), the number of times that a

configuration x (i.e., XV=x) is observed in a dataset D={x1,…,xN} can be represented as follows:

In terms of the counts, the log likelihood is given by: There is a nasty log Z in the likelihood

∑ ∑

= =

c V

m m m

c n n

\

count) (clique ) ( ) ( and , count) (total ) , ( ) (

def def x

x x x x x δ Z N m Z m p p D p p D p

c c c c c c c n n n n n

c n

log ) ( log ) ( ) ( log ) ( ) | ( log ) , ( ) | ( log ) , ( ) ( log ) | ( ) (

) , (

− = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = = = =

∑∑ ∑ ∏ ∑∑ ∑∑ ∏∏

x x x x x x x x x x x

x x x x x x x

ψ ψ θ δ θ δ θ θ θ

δ

1 l

slide-3
SLIDE 3

3

Derivative of log Likelihood

Log-likelihood: First term: Second term:

Z N m

c c c c

c

log ) ( log ) ( − = ∑∑ x x

x

ψ l ) ( ) ( ) (

c c c c c

m x x x ψ ψ = ∂ ∂ 1 l ) ( ) ( ) ~ ( ) , ~ ( ) ( ) ~ ( ) ~ ( ) , ~ ( ) ~ ( ) ( ) , ~ ( ) ~ ( ) ( ) ( log

~ ~ ~ ~ c c c c c c c d d d c c c c d d d c c c c d d d c c c c

p p Z Z Z Z x x x x x x x x x x x x x x x x x

x x x x

ψ δ ψ ψ ψ δ ψ ψ δ ψ ψ ψ = = = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ∂ ∂ = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ∂ ∂ = ∂ ∂

∑ ∏ ∑ ∏ ∑ ∑∏

1 1 1 1 1

Set the value of variables to x

x ~

Conditions on Clique Marginals

Derivative of log-likelihood Hence, for the maximum likelihood parameters, we know that: In other words, at the maximum likelihood setting of the

parameters, for each clique, the model marginals must be equal to the observed marginals (empirical counts).

This doesn’t tell us how to get the ML parameters, it just gives

us a condition that must be satisfied when we have them.

) ( ) ( ) ( ) ( ) (

c c c c c c c c

p N m x x x x x ψ ψ ψ − = ∂ ∂l

) ( ~ ) ( ) (

def * c c c MLE

p N m p x x x = =

slide-4
SLIDE 4

4

MLE for undirected graphical models

Is the graph decomposable (triangulated)? Are all the clique potentials defined on maximal cliques (not

sub-cliques)? e.g., ψ123, ψ234 not ψ12, ψ23, …

Are the clique potentials full tables (or Gaussians), or

parameterized more compactly, e.g. ?

X1 X X4

4

X X3

3

X X2

2

X1 X X4

4

X X3

3

X X2

2

( )

=

c c k k c c

f ) ( exp ) ( x x θ ψ

Gradient

  • GIF
  • IPF

  • Direct

√ √ √ Method Tabular? Max clique? Decomposable?

MLE for decomposable undirected models

Decomposable models:

  • G is decomposable ⇔ G is triangulated ⇔ G has a junction tree
  • Potential based representation:

Consider a chain X1 − X2 − X3. The cliques are (X1,X2 ) and

(X2,X3); the separator is X2

  • The empirical marginals must equal the model marginals.

Let us guess that

  • We can verify that such a guess satisfies the conditions:

and similarly

) ( ~ ) , ( ~ ) , ( ~

) , , (

2 3 2 2 1

3 2 1 x p x x p x x p MLE

x x x p = ) ) , ( ~ ) , ( ~ ) | ( ~ ) , , ( ) , (

2 1 3 2 2 1 3 2 1 2 1

3 3

x x p x x p x x p x x x p x x p

x x MLE MLE

= = =

∑ ∑

) ) ) , ( ~ ) , (

3 2 3 2

x x p x x pMLE = )

∏ ∏

=

s s s c c c

p ) ( ) ( ) ( x x x ϕ ψ

slide-5
SLIDE 5

5

MLE for decomposable undirected models (cont.)

Let us guess that To compute the clique potentials, just equate them to the

empirical marginals (or conditionals), i.e., the separator must be divided into one of its neighbors. Then Z = 1.

One more example: X1 X X4

4

X X3

3

X X2

2

) , ( ~ ) , , ( ~ ) , , ( ~ ) , , , (

3 2 4 3 2 3 2 1 4 3 2 1

x x p x x x p x x x p x x x x pMLE = ) ) , | ( ~ ) , ( ~ ) , , ( ~ ) , (

3 2 1 3 2 3 2 1 3 2 123

x x x p x x p x x x p x x

MLE

= = ψ ) ) , , ( ~ ) , , (

4 3 2 4 3 2 234

x x x p x x x

MLE

= ψ )

) ( ~ ) , ( ~ ) , ( ~

) , , (

2 3 2 2 1

3 2 1 x p x x p x x p MLE

x x x p = ) ) , ( ~ ) , (

2 1 2 1 12

x x p x x

MLE

= ψ ) ) | ( ~ ) ( ~ ) , ( ~ ) , (

3 2 2 3 2 3 2 23

x x p x p x x p x x

MLE

= = ψ )

Non-decomposable and/or with non-maximal clique potentials

If the graph is non-decomposable, and or the potentials are

defined on non-maximal cliques (e.g., ψ12, ψ34), we could not equate empirical marginals (or conditionals) to MLE of cliques potentials.

X1 X X4

4

X X3

3

X X2

2

X1 X X4

4

X X3

3

X X2

2

=

} , {

) , ( ) , , , (

j i j i ij

x x x x x x p ψ

4 3 2 1

⎪ ⎩ ⎪ ⎨ ⎧ ≠ ∃ ) ( ~ / ) , ( ~ ) ( ~ / ) , ( ~ ) , ( ~ ) , ( s.t. ) , (

MLE j j i i j i j i j i ij

x p x x p x p x x p x x p x x j i ψ Homework! Homework!

slide-6
SLIDE 6

6

Iterative Proportional Fitting (IPF)

From the derivative of the likelihood: we can derive another relationship:

in which ψc appears implicitly in the model marginal p(xc).

This is therefore a fixed-point equation for ψc.

  • Solving ψc in closed-form is hard, because it appears on both sides of

this implicit nonlinear equation.

The idea of IPF is to hold ψc fixed on the right hand side (both

in the numerator and denominator) and solve for it on the left hand side. We cycle through all cliques, then iterate:

) ( ) ( ) ( ) ( ) (

c c c c c c c c

p N m x x x x x ψ ψ ψ − = ∂ ∂l ) ( ) ( ) ( ) ( ~

c c c c c c

p p x x x x ψ ψ = ) ( ) ( ~ ) ( ) (

) ( ) ( ) ( c t c c t c c t c

p p x x x x ψ ψ =

+1

Need to do inference here Need to do inference here

Properties of IPF Updates

IPF iterates a set of fixed-point equations. However, we can prove it is also a coordinate ascent

algorithm (coordinates = parameters of clique potentials).

Hence at each step, it will increase

the log-likelihood, and it will converge to a global maximum.

I-projection: finding a distribution with

the correct marginals that has the maximal entropy

slide-7
SLIDE 7

7

KL Divergence View

IPF can be seen as coordinate ascent in the likelihood using

the way of expressing likelihoods using KL divergences.

Recall that we have shown maximizing the log likelihood is

equivalent to minimizing the KL divergence (cross entropy) from the observed distribution to the model distribution:

Using a property of KL divergence based on the conditional

chain rule: p(x) = p(xa)p(xb|xa):

( ) ∑

= ⇔

x

x p x p x p x p x p ) | ( ) ( ~ log ) ( ~ ) | ( || ) ( ~ min max θ θ KL l

( ) ( ) ( )

∑ ∑ ∑ ∑

+ = + = =

a b a b a b a

x a b a b a a a x x a b a b a b a x x a a a b a x x a b a a b a a b a b a b a

x x p x x q x q x p x q x x p x x q x x q x q x p x q x x q x q x x p x p x x q x q x x q x q x x p x x q ) | ( || ) | ( ) ( ) ( || ) ( ) | ( ) | ( log ) | ( ) ( ) ( ) ( log ) | ( ) ( ) | ( ) ( ) | ( ) ( log ) | ( ) ( ) , ( || ) , (

, , ,

KL KL KL

IPF minimizes KL divergence

Putting things together, we have

It can be shown that changing the clique potential ψc has no effect

  • n the conditional distribution, so the second term in unaffected.

To minimize the first term, we set the marginal to the

  • bserved marginal, just as in IPF.

We can interpret IPF updates as retaining the “old” conditional

probabilities p(t)(x-c|xc) while replacing the “old” marginal probability p(t)(xc) with the observed marginal .

( ) ( ) ( )

− −

+ =

a

x c c c c c c c

p p p p p p p ) | ( || ) | ( ~ ) ( ~ ) | ( || ) ( ~ ) | ( || ) ( ~ x x x x x x x x x KL KL KL θ θ ) ( ~

c

p x

slide-8
SLIDE 8

8

Feature-based Clique Potentials

So far we have discussed the most general form of an

undirected graphical model in which cliques are parameterized by general potential functions ψc(xc).

But for large cliques these general potentials are

exponentially costly for inference and have exponential numbers of parameters that we must learn from limited data.

One solution: change the graphical model to make cliques

  • smaller. But this changes the dependencies, and may force

us to make more independence assumptions than we would like.

Another solution: keep the same graphical model, but use a

less general parameterization of the clique potentials.

This is the idea behind feature-based models.

Features

Consider a clique xc of random variables in a UGM, e.g. three

consecutive characters c1c2c3 in a string of English text.

How would we build a model of p(c1c2c3)?

  • If we use a single clique function over c1c2c3, the full joint clique potential would

be huge: 263−1 parameters.

  • However, we often know that some particular joint settings of the variables in a

clique are quite likely or quite unlikely. e.g. ing, ate, ion, ?ed, qu?, jkx, zzz,... A “feature” is a function which is vacuous over all joint settings

except a few particular ones on which it is high or low.

  • For example, we might have fing(c1c2c3) which is 1 if the string is ’ing’ and 0
  • therwise, and similar features for ’?ed’, etc.

We can also define features when the inputs are continuous.

Then the idea of a cell on which it is active disappears, but we might still have a compact parameterization of the feature.

slide-9
SLIDE 9

9

Features as Micropotentials

By exponentiating them, each feature function can be made

into a “micropotential”. We can multiply these micropotentials together to get a clique potential.

Example: a clique potential ψ(c1c2c3) could be expressed as: This is still a potential over 263 possible settings, but only

uses K parameters if there are K features.

  • By having one indicator function per combination of xc, we recover the

standard tabular potential.

⎭ ⎬ ⎫ ⎩ ⎨ ⎧ = × × =

= K k k k f f c

c c c f e e c c c

1 3 2 1 3 2 1

) , , ( exp ) , , (

?ed ?ed ing ing

θ ψ

θ θ

K

Combining Features

Each feature has a weight θk which represents the numerical

strength of the feature and whether it increases or decreases the probability of the clique.

The marginal over the clique is a generalized exponential

family distribution, actually, a GLIM:

In general, the features may be overlapping, unconstrained

indicators or any function of any subset of the clique variables:

How can we combine feature into a probability model?

⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ + + + + ∝ L ) , , ( ) , , ( ) , , ( ) , , ( exp ) , , (

zzz zzz qu? qu? ?ed ?ed ing ing 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1

c c c f c c c f c c c f c c c f c c c p θ θ θ θ ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ =

∈ c

i

I i c k k c c

f ) ( exp ) (

def

x x θ ψ

slide-10
SLIDE 10

10

Feature Based Model

We can multiply these clique potentials as usual: However, in general we can forget about associating features

with cliques and just use a simplified form:

This is just our friend the exponential family model, with the

features as sufficient statistics!

Learning: recall that in IPF, we have

  • Not obvious how to update the weights and features individually

⎭ ⎬ ⎫ ⎩ ⎨ ⎧ = =

∑∑ ∏

∈ c I i c k k c c c

c i

f Z Z p ) ( exp ) ( ) ( ) ( ) ( x x x θ θ ψ θ 1 1 ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ =

i c i i

i

f Z p ) ( exp ) ( ) ( x x θ θ 1

) ( ) ( ~ ) ( ) (

) ( ) ( ) ( c t c c t c c t c

p p x x x x ψ ψ =

+1

MLE of Feature Based UGMs

Scaled likelihood function Instead of optimizing this objective directly, we attack its lower

bound

  • The logarithm has a linear upper bound …
  • This bound holds for all µ, in particular, for
  • Thus we have

∑ ∑

= = =

x n n

x p x p x p N N D D ) | ( log ) ( ~ ) | ( log / ) ; ( ) ; ( ~ θ θ θ θ 1 l l

∑ ∑

− =

x i i i

Z x f x p ) ( log ) ( ) ( ~ θ θ

1 − − ≤ µ θ µ θ log ) ( ) ( log Z Z ) (

) (t

Z θ µ

1 −

=

1 + − − ≥ ∑

x t t i i i

Z Z Z x f x p D ) ( log ) ( ) ( ) ( ) ( ~ ) ; ( ~

) ( ) (

θ θ θ θ θ l

slide-11
SLIDE 11

11

Generalized Iterative Scaling (GIS)

Lower bound of scaled loglikelihood Define Relax again

  • Assume
  • Convexity of exponential:

We have:

1 + − − ≥ ∑

x t t i i i

Z Z Z x f x p D ) ( log ) ( ) ( ) ( ) ( ~ ) ; ( ~

) ( ) (

θ θ θ θ θ l

) ( def ) ( t i i t i

θ θ θ − = ∆

1 1 1 1 1 + − ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ ∆ − = + − ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ ∆ ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ − = + − ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ − ≥

∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑

) ( log ) ( exp ) | ( ) ( ) ( ~ ) ( log ) ( exp ) ( exp ) ( ) ( ) ( ~ ) ( log ) ( exp ) ( ) ( ) ( ~ ) ; ( ~

) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( t i i t i x t i i x i t i i t i i i t i x t i i x i x t i i i x t i i i

Z x f x p x f x p Z x f x f Z x f x p Z x f Z x f x p D θ θ θ θ θ θ θ θ θ θ θ θ θ θ l

( )

( )

∑ ∑

i i i i i i

x x exp exp π π

( )

) ( ) ( log exp ) ( ) | ( ) ( ) ( ~ ) ; ( ~

def ) ( ) ( ) (

θ θ θ θ θ θ Λ = + − ∆ − ≥

∑ ∑ ∑ ∑

1

t x i t i i t i i x i

Z x f x p x f x p D l

1 = ≥

∑i

i i

x f x f ) ( , ) (

GIS

Lower bound of scaled loglikelihood Take derivative: Set to zero

  • where p(t)(x) is the unnormalized version of p(x|θ(t))

Update

( )

) ( ) ( log exp ) ( ) | ( ) ( ) ( ~ ) ; ( ~

def ) ( ) ( ) (

θ θ θ θ θ θ Λ = + − ∆ − ≥

∑ ∑ ∑ ∑

1

t x i t i i t i i x i

Z x f x p x f x p D l

( )∑

∆ − = ∂ Λ ∂

x i t t i i x i

x f x p x f x p ) ( ) | ( exp ) ( ) ( ~

) ( ) (

θ θ θ

) ( ) ( ) ( ) ( ) ( ~ ) ( ) | ( ) ( ) ( ~

) ( ) ( ) (

) (

t x i t i x x i t i x

Z x f x p x f x p x f x p x f x p e

t i

θ θ

θ

∑ ∑ ∑ ∑

= =

) ( ) ( ) ( ) ( ) ( ) (

) (

) ( ) (

x f t t t i t i t i

i t i

e x p x p

θ

θ θ θ

∆ + +

= ⇒ ∆ + =

1 1

( )

∏ ∏ ∏

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ∑ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ =

∑ ∑ ∑ ∑ ∑ ∑ + i x f x f x p x f x p t i x f t x f x f x p x f x p t t i x f t x f x p x f x p t t t

i x i t i x i i i x i t i x i x i t i x

x p Z Z x p Z Z x p x p

) ( ) ( ) ( ) ( ) ( ~ ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ~ ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ~ ) ( ) ( ) (

) ( ) ( ) (

) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( θ θ θ θ

1

slide-12
SLIDE 12

12

Where does the exponential form come from?

Review: Maximum Likelihood for exponential family i.e., At ML estimate, the expectations of the sufficient statistics

under the model must match empirical feature average.

∑ ∑ ∑ ∑ ∑

− = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − = =

x i i i x i i i x

Z N x f x m Z x f x m x p x m D ) ( log ) ( ) ( ) ( log ) ( ) ( ) | ( log ) ( ) ; ( θ θ θ θ θ θ l

∑ ∑ ∑

− = ∂ ∂ − = ∂ ∂

x i x i x i i i

x f x p N x f x m Z N x f x m D ) ( ) | ( ) ( ) ( ) ( log ) ( ) ( ) ; ( θ θ θ θ θ l

∑ ∑ ∑

= = ⇒

x i x i x i

x f x p x f N x m x f x p ) ( ) | ( ~ ) ( ) ( ) ( ) | ( θ θ

Maximum Entropy

We can approach the modeling problem from an entirely

different point of view. Begin with some fixed feature expectations:

Assuming expectations are consistent, there may exist many

distributions which satisfy them. Which one should we select?

  • The most uncertain or flexible one, i.e., the one with maximum entropy.

This yields a new optimization problem:

i i x

x f x p α =

) ( ) (

( )

∑ ∑ ∑

= = − =

x i i x x p

x p x f x p x p x p x p 1 ) ( ) ( ) ( s.t. ) ( log ) ( ) ( H max α This is a This is a variational variational definition of a distribution! definition of a distribution!

slide-13
SLIDE 13

13

Solution to the MaxEnt Problem

To solve the MaxEnt problem, we use Lagrange multipliers: So feature constraints + MaxEnt ⇒ exponential family. Problem is strictly convex w.r.t. p, so solution is unique.

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − − ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − − − =

∑ ∑ ∑ ∑

x i i i x i x

x p x f x p x p x p L 1 ) ( ) ( ) ( ) ( log ) ( µ α θ

⎭ ⎬ ⎫ ⎩ ⎨ ⎧ = = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ = = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ = − − + = ∂ ∂

∑ ∑ ∑ ∑ ∑ ∑

− − i i i x x i i i i i i i i i

x f Z x p x p x f e Z x f e x p x f x p x p L ) ( exp ) ( ) ( ) ) ( (since ) ( exp ) ( ) ( exp ) ( ) ( ) ( log ) (

* *

θ θ θ θ θ θ µ θ

µ µ

1 1 1

1 1

A more general MaxEnt problem

( )

∑ ∑ ∑ ∑

= = − − = =

x i i x x x p

x p x f x p x h x p p x h x p x p x h x p 1 ) ( ) ( ) ( s.t. ) ( log ) ( ) ( H ) ( ) ( log ) ( ) ( || ) ( KL min

def

α

⎭ ⎬ ⎫ ⎩ ⎨ ⎧ = ⇒

i i i

x f x h Z x p ) ( exp ) ( ) ( ) ( θ θ θ 1

slide-14
SLIDE 14

14

Constraints from Data

Where do the constraints αi come from? Just as before, measure the empirical counts on the training

data:

This also ensures consistency automatically. Known as the “method of moments”. (c.f. law of large

numbers)

We have seen a case of convex duality:

  • In one case, we assume exponential family and show that ML implies

model expectations must match empirical expectations.

  • In the other case, we assume model expectations must match empirical

feature counts and show that MaxEnt implies exponential family distribution.

  • No duality gap ⇒ yield the same value of the objective

) ( ) ( ~ ) (

) (

x x x

x i x x i N m i

f p f

∑ ∑

= = α

Geometric interpretation

All exponential family distribution: All distributions satisfying moment constraints Pythagorean theorem

⎭ ⎬ ⎫ ⎩ ⎨ ⎧ ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ = =

i i i

x f x h Z x p x p ) ( exp ) ( ) ( ) ( : ) ( θ θ θ 1 E ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ = =

∑ ∑

) ( ) ( ~ ) ( ) ( : ) ( x f x p x f x p x p

i x i x

M

( ) ( ) ( )

q p p q p q

M M

|| KL || KL || KL + = ( ) ( ) ( ) ( )

|| KL || KL || KL s.t. || KL min : MaxEnt h p p q h q q h q

M M p

+ = ∈M

( ) ( ) ( ) ( )

|| KL || ~ KL || ~ KL s.t. || ~ KL min : MaxLik p p p p p p q p p

M M p

+ = ∈E

slide-15
SLIDE 15

15

Conditional Random Fields

So far we have focussed on maxent models for density

estimation.

We can also formulate such models for classification and

regression (conditional density estimation).

The model above is like doing logistic regression on the

  • features. Now features can be very complex, nonlinear

functions of the data.

⎭ ⎬ ⎫ ⎩ ⎨ ⎧ =

c c c c

y x f x Z x y p ) , ( exp ) , ( ) | ( θ θ

θ

1 Y Y1

1

Y Y2

2

Y Y5

5

… …

X X1

1 …

… X Xn

n

Conditional Random Fields

  • Allow arbitrary dependencies
  • n input
  • Clique dependencies on labels
  • Use approximate inference for

general graphs

⎭ ⎬ ⎫ ⎩ ⎨ ⎧ =

c c c c

y x f x Z x y p ) , ( exp ) , ( ) | ( θ θ

θ

1