Undirected Graphical Models: Markov Random Fields Probabilistic - - PowerPoint PPT Presentation

undirected graphical models
SMART_READER_LITE
LIVE PREVIEW

Undirected Graphical Models: Markov Random Fields Probabilistic - - PowerPoint PPT Presentation

Undirected Graphical Models: Markov Random Fields Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2018 Markov Network Structure: undirected graph Undirected edges show correlations (non-causal


slide-1
SLIDE 1

Undirected Graphical Models: Markov Random Fields

Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2018

slide-2
SLIDE 2

Markov Network

2

 Structure: undirected graph  Undirected

edges show correlations (non-causal relationships) between variables

 e.g., Spatial image analysis: intensity of neighboring pixels are

correlated

A B C D Markov Network

slide-3
SLIDE 3

MRF: Joint distribution

3

 Factor 𝜚(𝑌1, … , 𝑌𝑙)

 𝜚: 𝑊𝑏𝑚(𝑌1, … , 𝑌𝑙) → ℝ  Scope: {𝑌1, … , 𝑌𝑙}

Joint distribution is parametrized by factors 𝚾 = 𝜚1 𝑬1 , … , 𝜚𝐿 𝑬𝐿 : 𝑄 𝑌1, … , 𝑌𝑂 = 1 𝑎

𝑙

𝜚𝑙(𝑬𝑙)

𝑎 =

𝒀 𝑙

𝜚𝑙(𝑬𝑙)

𝑬𝑙: the set of variables in the k-th factor 𝑎: normalization constant called partition function

slide-4
SLIDE 4

Misconception example

4

Factors show “compatibilities” between different values of the variables in their scope A factor is only one contribution to the overall joint distribution. [Koller & Friedman] 𝐵 = 0

slide-5
SLIDE 5

5

slide-6
SLIDE 6

Misconception example

6

 Some inferences: 𝑄 𝐵, 𝐶 =

slide-7
SLIDE 7

MRF: Gibbs distribution

7

Gibbs distribution with factors 𝚾 = {𝜚1 𝒀𝐷1 , … , 𝜚𝐿 𝒀𝐷𝐿 }: 𝑄𝚾 𝑌1, … , 𝑌𝑂 = 1 𝑎

𝑗=1 𝐿

𝜚𝑗(𝒀𝐷𝑗)

𝑎 =

𝒀 𝑗=1 𝐿

𝜚𝑗(𝒀𝐷𝑗)

 𝜚𝑗 𝒀𝐷𝑗 : potential function on clique 𝐷𝑗

 𝜚𝑗: Local contingency functions  𝒀𝐷𝑗: the set of variables in the clique 𝐷𝑗

 Potential functions and cliques in the graph completely

determine the joint distribution.

slide-8
SLIDE 8

MRF Factorization: clique

8

 Factors are functions of the variables in the cliques

 T

  • reduce the number of factors we can only allow factors for

maximal cliques

A B C D Cliques: {A,B,C}, {B,C,D}, {A,B}, {A,C}, {B,C}, {B,D}, {C,D}, {A}, {B}, {C}, {D} Max-cliques: {A,B,C}, {B,C,D}

Clique: subsets of nodes in the graph that are fully connected (complete subgraph) Maximal clique: where no superset of the nodes in a clique are also compose a clique, the clique is maximal

slide-9
SLIDE 9

Relation between factorization and independencies

9

 Theorem:  Let 𝒀, 𝒁, 𝒂 be three disjoint sets of variables:  𝑄 ⊨ 𝒀 ⊥ 𝒁|𝒂 iff 𝑄 𝒀, 𝒁, 𝒂 = 𝑔 𝒀, 𝒂 𝑕(𝒁, 𝒂)

slide-10
SLIDE 10

MRF Factorization and pairwise independencies

10

 A

distribution 𝑄𝚾 with 𝚾 = {𝜚1 𝑬1 , … , 𝜚𝐿 𝑬𝐿 } factorizes over an MRF 𝐼 if each 𝑬𝑙 is a complete subgraph of 𝐼

 To hold conditional independence property, 𝑌𝑗 and 𝑌

𝑘

that are not directly connected must not appear in the same factor in the distributions belonging to the graph

slide-11
SLIDE 11

MRFs: Global Independencies

11

 Global independencies for any disjoint sets A, B, C:

 𝐵 ⊥ 𝐶|𝐷

A path is active given 𝑎 if no node in it is in 𝑎 𝑌 and 𝑍 are separated given 𝑎 if there is no active path between 𝑌 and 𝑍 given 𝑎

𝑌 𝑎 𝑍 If all paths that connect a node in 𝐵 to a node in 𝐶 pass through one or more nodes in set 𝐷

Separation in the undirected graph: sep𝐼(𝑌, 𝑍|𝑎)

slide-12
SLIDE 12

MRF: independencies

12

 Determining

conditional independencies in undirected models is much easier than in directed ones

 Conditioning in undirected models can only eliminate

dependencies while in directed ones observations can create new dependencies (v-structure)

slide-13
SLIDE 13

MRF: global independencies

13

 Independencies encoded by 𝐼 (that are found using the

graph separation discussed previously):

𝐽(𝐼) = {(𝒀 ⊥ 𝒁|𝒂) ∶ sep𝐼(𝒀, 𝒁|𝒂)}

 If 𝑄

satisfies 𝐽(𝐼) , we say that 𝐼 is an I-map (independency map) of 𝑄

 𝐽 𝐼 ⊆ 𝐽 𝑄 where 𝐽 𝑄 = 𝒀, 𝒁 𝒂 ∶ 𝑄 ⊨ (𝒀 ⊥ 𝒁|𝒂)}

slide-14
SLIDE 14

Factorization & Independence

14

 Factorization ⇒ Independence (soundness of separation

criterion)

 Theorem: If 𝑄 factorizes over 𝐼, and sep𝐼(𝒀, 𝒁|𝒂) then 𝑄

satisfies 𝒀 ⊥ 𝒁|𝒂 (i.e., 𝐼 is an I-map of 𝑄)

 Independence ⇒ Factorization

 Theorem (Hammersley Clifford): For a positive distribution 𝑄,

if 𝑄 satisfies 𝐽(𝐼) = {(𝒀 ⊥ 𝒁|𝒂) ∶ sep𝐼(𝒀, 𝒁|𝒂)} then 𝑄 factorizes over 𝐼

slide-15
SLIDE 15

Factorization & Independence

15

 Theorem: Two equivalent views of graph structure for

positive distributions:

 If 𝑄 satisfies all independencies held in 𝐼, then it can be

represented factorized on cliques of 𝐼

 If 𝑄 factorizes over a graph 𝐼, we can read from the graph

structure, independencies that must hold in 𝑄

slide-16
SLIDE 16

Factorization on Markov networks

16

 It is not as intuitive as that of Bayesian networks

 The correspondence between the factors in a Gibbs distribution

and the distribution 𝑄 is much more indirect

 Factors

do not necessarily correspond either to probabilities or to conditional probabilities.

 The parameters (of factors) may not be intuitively

understandable, making them hard to elicit from people.

 There are no constraints on the parameters in a factor

 While both CPDs and joint distributions must satisfy certain

normalization constraints

slide-17
SLIDE 17

Interpretation of clique potentials

17

 Potentials

cannot all be marginal

  • r

conditional distributions

 A positive clique potential can be considered as general

compatibility or goodness measure over values of the variables in its scope

slide-18
SLIDE 18

Different factorizations

18

 Maximal cliques:

 𝑄𝚾 𝑌1, 𝑌2, 𝑌3, 𝑌4 = 1 𝑎 𝜚123 𝑌1, 𝑌2, 𝑌3 𝜚234 𝑌2, 𝑌3, 𝑌4  𝑎 = 𝑌1,𝑌2,𝑌3,𝑌4 𝜚123 𝑌1, 𝑌2, 𝑌3 𝜚234 𝑌2, 𝑌3, 𝑌4

 Sub-cliques:

 𝑄𝚾′ 𝑌1, 𝑌2, 𝑌3, 𝑌4

= 1

𝑎 𝜚12 𝑌1, 𝑌2 𝜚23 𝑌2, 𝑌3 𝜚13 𝑌1, 𝑌3 𝜚24 𝑌2, 𝑌4 𝜚34 𝑌3, 𝑌4  𝑎 = 𝑌1,𝑌2,𝑌3,𝑌4 𝜚12 𝑌1, 𝑌2 𝜚23 𝑌2, 𝑌3 𝜚13 𝑌1, 𝑌3 𝜚24 𝑌2, 𝑌4 𝜚34 𝑌3, 𝑌4

 Canonical representation

 𝑄𝚾′ 𝑌1, 𝑌2, 𝑌3, 𝑌4

= 1

𝑎 𝜚123 𝑌1, 𝑌2, 𝑌3 𝜚234 𝑌2, 𝑌3, 𝑌4 𝜚12 𝑌1, 𝑌2 𝜚23 𝑌2, 𝑌3 𝜚13 𝑌1, 𝑌3

× 𝜚24 𝑌2, 𝑌4 𝜚34 𝑌3, 𝑌4 𝜚1 𝑌1 𝜚2 𝑌2 𝜚3 𝑌3 𝜚4 𝑌4

 𝑎 = 𝑌1,𝑌2,𝑌3,𝑌4 𝜚123 𝑌1, 𝑌2, 𝑌3 𝜚234 𝑌2, 𝑌3, 𝑌4 𝜚12 𝑌1, 𝑌2 𝜚23 𝑌2, 𝑌3 ×

𝜚13 𝑌1, 𝑌3 𝜚24 𝑌2, 𝑌4 𝜚34 𝑌3, 𝑌4 𝜚1 𝑌1 𝜚2 𝑌2 𝜚3 𝑌3 𝜚4 𝑌4 𝑌1 𝑌2 𝑌3 𝑌4

slide-19
SLIDE 19

Pairwise MRF

19

 All of the factors on single variables or pair of variables (𝑌𝑗, 𝑌

𝑘):

𝑄 𝒀 = 1 𝑎

𝑌𝑗,𝑌𝑘 ∈𝐼

𝜚𝑗𝑘 𝑌𝑗, 𝑌

𝑘 𝑗

𝜚𝑗 𝑌𝑗

 Pairwise MRFs are popular (simple special case of general

MRFs)

 consider pairwise interactions and not interactions of larger subset of

vars.

 Pairwise

MRFs are attractive because

  • f

their simplicity, and because interactions on edges are an important special case that often arises in practice

 In general, they do not have enough parameters to encompass the whole

space of joint distributions

slide-20
SLIDE 20

Factor graph

20

 Markov network structure doesn’t itself fully specify the

factorization of 𝑄

 does not generally reveal all the structure in a Gibbs

parameterization

 Factor graph: two kinds of nodes

 Variable nodes  Factor nodes

 Factor graph is a useful structure for inference and

parametrization (as we will see)

𝑌1 𝑌2 𝑌3 𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑄 𝑌1, 𝑌2, 𝑌3 = 𝑔

1 𝑌1, 𝑌2, 𝑌3 𝑔 2 𝑌1, 𝑌2 𝑔 3 𝑌2, 𝑌3 𝑔 4(𝑌3)

slide-21
SLIDE 21

Energy function

21

 Constraining clique potentials to be positive could be

inconvenient

 We represent a clique potential in an unconstrained form using

a real-value "energy" function

 If potential functions are strictly positive 𝜚𝐷 𝒀𝐷 > 0:

𝜚𝐷 𝒀𝐷 = exp −𝐹𝐷(𝒀𝐷)

𝑄 𝒀 = 1 𝑎 exp{−

𝐷

𝐹𝐷(𝒀𝐷)}

𝐹(𝒀𝐷): energy function 𝐹𝐷 𝒀𝐷 = − ln 𝜚𝐷 𝒀𝐷

slide-22
SLIDE 22

Log-linear models

22

 Defining the energy function as a linear combination of

features

 A set of 𝑛

features {𝑔

1 𝑬1 , … , 𝑔 𝑛 𝑬𝑛 }

  • n complete

subgraphs where 𝑬𝑗 shows the scope of the i-th feature:

 Scope of a feature is a complete subgraph  We can have different features over a sub-graph

𝑄 𝒀 = 1 𝑎 exp −

𝑗=1 𝑛

𝑥𝑗𝑔

𝑗(𝑬𝑗)

slide-23
SLIDE 23

Ising model

23

 Most likely joint-configurations usually correspond to a

"low-energy" state

 𝑌𝑗 ∈ −1,1  Grid model

 Image processing, lattice physics, etc.  The states of adjacent nodes are related 𝑄 𝒚 = 1 𝑎 exp

𝑗

𝑣𝑗𝑦𝑗 +

𝑗,𝑘∈𝐹

𝑥𝑗𝑘𝑦𝑗𝑦𝑘 Ising model uses 𝑔

𝑗𝑘 𝑦𝑗, 𝑦𝑘 = 𝑦𝑗𝑦𝑘

slide-24
SLIDE 24

Shared features in log-linear models

24

 In most practical models, same feature and weight are

used over many scopes

𝑄 𝒚 = 1 𝑎 exp

𝑗

𝑣𝑗𝑦𝑗 +

(𝑗,𝑘)∈𝐼

𝑥𝑗𝑘𝑦𝑗𝑦𝑘 𝑔

𝑗𝑘 𝑦𝑗, 𝑦𝑘 = 𝑔 𝑦𝑗, 𝑦𝑘 = 𝑦𝑗𝑦𝑘

𝑄 𝒚 = 1 𝑎 exp

𝑗

𝑣𝑦𝑗 +

(𝑗,𝑘)∈𝐼

𝑥𝑦𝑗𝑦𝑘 𝑥𝑗𝑘 = 𝑥

slide-25
SLIDE 25

Image denoising

25

 𝑧𝑗 ∈ −1,1 , 𝑗 = 1, … , 𝐸: array of observed noisy pixels  𝑦𝑗 ∈ −1,1 , 𝑗 = 1, … , 𝐸: noise free image

[Bishop]

slide-26
SLIDE 26

Image denoising

26

𝜗1 𝑦𝑗, 𝑦𝑘 = −𝛽𝑦𝑗𝑦𝑘 𝜗2 𝑦𝑗 = −𝛾𝑦𝑗 𝜗3 𝑦𝑗, 𝑧𝑗 = 𝛿𝑦𝑗𝑧𝑗 𝑄 𝒚, 𝒛 = 1 𝑎

𝑗

exp{−𝜗1 𝑦𝑗, 𝑧𝑗 }

𝑗

exp{−𝜗2 𝑦𝑗 }

𝑗,𝑘∈𝐼

exp{−𝜗3 𝑦𝑗, 𝑦𝑘 } = 1 𝑎 exp{−

𝑗

𝜗1 𝑦𝑗, 𝑧𝑗 −

𝑗

𝜗2 𝑦𝑗 −

𝑗,𝑘∈𝐼

𝜗3 𝑦𝑗, 𝑦𝑘 }

𝒚 = argmax

𝒚

𝑄(𝒚|𝒛) MPA: Most probable assignment of 𝒚 variables given an evidence 𝒛

slide-27
SLIDE 27

Image denoising

27

𝐹 𝒚, 𝒛 = −ℎ

𝑗

𝑦𝑗 − 𝛾

𝑗,𝑘 ∈𝐼

𝑦𝑗𝑦𝑘 − 𝜃

𝑗

𝑦𝑗𝑧𝑗 𝑄 𝒚, 𝒛 = 1 𝑎 exp{−𝐹(𝒚, 𝒛)} 𝒚 = argmax

𝒚

𝑄(𝒚|𝒛)

MPA: Most probable assignment of 𝒚 variables given an evidence 𝒛

slide-28
SLIDE 28

Image denoising (gray-scale image)

28

Metric MRF:

𝐹 𝒚, 𝒛 = −𝛾

𝑗,𝑘 ∈𝐼

min( 𝑦𝑗 − 𝑦𝑘

2, 𝑒) − 𝜃 𝑗

𝑦𝑗 − 𝑧𝑘

2

𝒚 = argmax

𝒚

1 𝑎 exp{−𝐹(𝒚, 𝒛)}

MPE: Most probable explanation of 𝒚 variables given an evidence 𝒛 𝑔

𝑗𝑘 𝑦𝑗, 𝑦𝑘 = 𝑔 𝑦𝑗, 𝑦𝑘 = min( 𝑦𝑗 − 𝑦𝑘 2, 𝑒)

slide-29
SLIDE 29

MRF: Pairwise and local independencies

29

 Pairwise independencies: 𝑌𝑗 ⊥ 𝑌

𝑘 | 𝒀 − {𝑌𝑗, 𝑌 𝑘}

 When there is not an edge between 𝑌𝑗 and 𝑌

𝑘

 Markov Blanket (local independencies): A variable is

conditionally independent

  • f

every

  • ther

variables conditioned only on its neighboring nodes 𝑌𝑗 ⊥ 𝒀 − 𝑌𝑗 − 𝑁𝐶 𝑌𝑗 | 𝑁𝐶(𝑌𝑗)

𝑁𝐶 𝑌𝑗 = {𝑌′ ∈ 𝒀|(𝑌𝑗, 𝑌′) ∈ 𝑓𝑒𝑕𝑓𝑡}

slide-30
SLIDE 30

Relationship between local and global Markov properties

30

 If 𝑄 ⊨ 𝐽𝑚(𝐼) then 𝑄 ⊨ 𝐽𝑞(𝐼).  If 𝑄 ⊨ 𝐽(𝐼) then 𝑄 ⊨ 𝐽𝑚(𝐼).  Theorem: For a positive distribution 𝑄, the following

three statements are equivalent:

 𝑄 ⊨ 𝐽𝑞(𝐼)  𝑄 ⊨ 𝐽𝑚(𝐼)  𝑄 ⊨ 𝐽(𝐼)

slide-31
SLIDE 31

One way to construct an undirected minimal I-map of a distribution

31

 𝐼 is a minimal I-map for 𝑄 if

 𝐽(𝐼) ⊆ 𝐽(𝑄)  Removal of a single edge in 𝐼 renders it is not an I-map of 𝐻

 We can define ℋ to include an edge 𝑌 − 𝑍 for all 𝑌, 𝑍

for which: 𝑄 ⊭ (𝑌 ⊥ 𝑍|𝒴 − {𝑌, 𝑍})

 Theorem: ℋ will be a unique minimal I-map for the

positive distribution 𝑄.

slide-32
SLIDE 32

Perfect map of a distribution

32

 Not every distribution has a MN perfect map  Not every distribution has a BN perfect map

Probabilistic models Directed Undirected Graphical models

slide-33
SLIDE 33

Loop of at least 4 nodes without chord has no equivalent in BNs

33

 Is there a BN that is a perfect map for this MN?

A B D C A B D C A B D C 𝐵 ⊥ 𝐷| 𝐶, 𝐸 𝐶 ⊥ 𝐸| 𝐵, 𝐷 A B D C 𝐶 ⊥ 𝐸| 𝐵, 𝐷 𝐵 ⊥ 𝐷| 𝐶, 𝐸 𝐶 ⊥ 𝐸| 𝐵, 𝐷 𝐵 ⊥ 𝐷| 𝐶, 𝐸 𝐵 ⊥ 𝐷| 𝐶, 𝐸 𝐶 ⊥ 𝐸| 𝐵, 𝐷

slide-34
SLIDE 34

V-structure has no equivalent in MNs

34

 Is there an MN that is a perfect I-map of this BN?

A B C A B C A B C 𝐵 ⊥ 𝐶 𝐵 ⊥ 𝐶|𝐷 𝐵 ⊥ 𝐶 𝐵 ⊥ 𝐶|𝐷 𝐵 ⊥ 𝐶 𝐵 ⊥ 𝐶|𝐷

slide-35
SLIDE 35

Minimal I-map

35

 Since we may not find an Markov Network (MN) that is a

perfect map of a BN and vice versa, we study the minimal I-map property

 𝐼 is a minimal I-map for 𝐻 if

 𝐽(𝐼) ⊆ 𝐽(𝐻)  Removal of a single edge in 𝐼 renders it is not an I-map of 𝐻

slide-36
SLIDE 36

Minimal I-maps: from DAGs to MNs

36

 The moral graph 𝑁(𝐻) of a DAG 𝐻 is an undirected

graph that contains an undirected edge between 𝑌 and 𝑍 if:

 there is a directed edge between them in either direction  𝑌 and 𝑍 are parents of the same node

 Moralization turns a node and its parent into a fully

connected sub-graph

A B C A B C

slide-37
SLIDE 37

Minimal I-maps: from DAGs to MNs

37

 Theorem: The moral graph 𝑁(𝐻) of a DAG 𝐻 is a minimal

I-map for 𝐻

 The moral graph loses some independence information  But, we cannot remove any edge from it without appearing new

independencies that are not in 𝐻

 all independencies in the moral graph are also satisfied in 𝐻

 Theorem: If a DAG 𝐻 is "moral", then its moralized graph

𝑁(𝐻) is a perfect I-map of 𝐻.

slide-38
SLIDE 38

Minimal I-maps: from MNs to DAGs

38

 Theorem: If 𝐻 is a BN that is minimal I-map for an MN,

then 𝐻 cannot have immoralities.

 Corollary: If 𝐻 is a minimal I-map for an MN then it is

chordal

 Any BN that is I-map for an MN must add triangulating edges

into the graph

A B D C An undirected graph is chordal if any loop with more than three nodes has a chord A B D C 𝐻 is a minimal I-map of the left MN

slide-39
SLIDE 39

Perfect I-map

39

 Theorem: Let 𝐼 be a non-chordal MN. Then there is no

BN that is a perfect I-map for 𝐼.

 ⇒

If the independencies in an MN can be exactly

represented via a BN then the MN graph is chordal

A B D C

slide-40
SLIDE 40

Perfect I-map

40

 Theorem: Let 𝐼 be a chordal MN. Then there exists a

DAG 𝐻 that is a perfect I-map for 𝐼

 ⇒ The independencies in a graph can be represented in

both type of models if and only if the graph is chordal

A B C D E A B C D E

slide-41
SLIDE 41

Relationship between BNs and MNs: summary

41

 Directed and undirected models represent different

families of independence assumptions

 Chordal graphs can be represented in both BNs and MNs

 For inference, we can use a single representation for both

types of these models

 simpler design and analysis of the inference algorithm

slide-42
SLIDE 42

Conditional Random Field (CRF)

42

 Undirected graph 𝐼 with nodes 𝒀 ∪ 𝒁

 𝒀: observed variables  𝒁: target variables

 Consider factors 𝚾 = {𝜚1 𝑬1 , … , 𝜚𝐿 𝑬𝐿 } where each 𝑬𝑗

⊈ 𝒀: 𝑄 𝒁|𝒀 = 1 𝑎 𝒀 𝑄 𝒁, 𝒀 𝑄 𝒁, 𝒀 =

𝑗=1 𝐿

𝜚𝑗(𝑬𝑗)

𝑎(𝒀) =

𝒁

𝑄 𝒁, 𝒀

 Nodes are connected by edge in 𝐼 whenever they appear

together in the scope of some factor

slide-43
SLIDE 43

CRF: discriminative model

43

 Conditional

probability 𝑄(𝒁|𝒀) rather than joint probability 𝑄(𝒁, 𝒀)

 CRF is based on the conditional probability of label sequence

given observation sequence

 The probability of a transition between labels may depend on

past and future observations

 Allow

arbitrary dependency between features

  • n

the

  • bservation sequence and we need not to model it

 As opposed to independence assumptions in generative models

slide-44
SLIDE 44

Naïve Markov Model

44

𝜚𝑗 𝑌𝑗, 𝑍 = exp 𝑥𝑗𝐽 𝑌𝑗 = 1, 𝑍 = 1 𝜚0 𝑍 = exp 𝑥0𝐽 𝑍 = 1 𝑄 𝑍 = 1 𝑌1, 𝑌2, … , 𝑌𝑙 = 𝜏 𝑥0 +

𝑘=1 𝑙

𝑥

𝑘𝑌 𝑘

𝑌1 𝑌2 𝑌𝑙 𝑍 … 𝑌𝑗 is binary random variable 𝑍: binary random variable

slide-45
SLIDE 45

CRF: logistic model Naïve Markov model

45

𝑄 𝑍, 𝒀 = exp 𝑥0𝐽 𝑍 = 1 +

𝑗=1 𝑛

𝑥𝑗𝐽 𝑌𝑗 = 1, 𝑍 = 1 𝑄 𝑍 = 1, 𝒀 = exp 𝑥0 +

𝑗=1 𝑛

𝑥𝑗𝑌𝑗 𝑄 𝑍 = 0, 𝒀 = exp 0 = 1 𝑄 𝑍 = 1 𝒀 = 1 1 + exp − 𝑥0 + 𝑘=1

𝑙

𝑥

𝑘𝑌 𝑘

= 𝜏 𝑥0 +

𝑗=1 𝑛

𝑥𝑗𝑌𝑗

Number of parameters is linear

slide-46
SLIDE 46

Linear-chain CRF

46

𝑄 𝒁|𝒀 = 1 𝑎 𝒀 𝑄 𝒁, 𝒀 𝑄 𝒁, 𝒀 =

𝑗=1 𝐿

𝜚(𝑍

𝑗, 𝑍 𝑗+1) 𝑗=1 𝐿

𝜚(𝑍

𝑗, 𝑌𝑗)

𝑎(𝒀) =

𝒁

𝑄 𝒁, 𝒀

𝑍

1

𝑍

2

… 𝑍

𝐿

𝑌1 𝑌2 … 𝑌𝐿 Linear-chain CRF

slide-47
SLIDE 47

CRF for sequence labeling

47

𝑍

1

𝑍

2

𝑍

𝑈

𝒀1:𝑈 … 𝑍

𝑈−1

𝑍

1

𝑍

2

𝑍

𝑈

𝑌1 𝑌2 𝑌𝑈

…𝑍

𝑈−1

𝑌𝑈−1

slide-48
SLIDE 48

CRF as a discriminative model

48

 Discriminative approach for labeling  CRF

does not model the distribution

  • ver

the

  • bservations

 Dependencies between observed variables may be quite

complex or poorly understood but we don’t worry about modeling them

𝑍

1

𝑍

2

… 𝑍

𝑈

𝑌1 𝑌2 … 𝑌𝑈 𝑍

1

𝑍

2

… 𝑍

𝑈

𝑌1, … , 𝑌𝑈 When labeling 𝑌𝑗 future observations are taken into account

slide-49
SLIDE 49

CRF: Named Entity Recognition

49

 Features: word capitalized, word in atlas of locations,

previous word is “Mrs”, …

𝜚(𝑍

𝑢, 𝑍 𝑢+1)

𝜚(𝑍

𝑢, 𝑌1, … , 𝑌𝑈)

[Koller & Friedman]

slide-50
SLIDE 50

CRF: Image segmentation example

50

 A node 𝑍

𝑗 for the label of each super-pixel

 𝑊𝑏𝑚(𝑍

𝑗) = {1,2, … , 𝐿} (i.e., grass, sky, water, …)

 An edge between 𝑍

𝑗 and 𝑍 𝑘 where the corresponding super-

pixels share a boundary

 A node 𝑌𝑗 for the features (e.g., color, texture, location) of

each super-pixel

slide-51
SLIDE 51

CRF: Image segmentation example

51

 Simple: 𝜚 𝑍

𝑗, 𝑍 𝑘 = exp{−𝜇𝐽(𝑍 𝑗 ≠ 𝑍 𝑘)}

 More complex:

 e.g., horse adjacent to vegetation than water  depends on the relative pixel location, e.g., water below vegetation,

sky above every thing

slide-52
SLIDE 52

CRF: Image segmentation example

52

[Koller & Friedman]

slide-53
SLIDE 53

Reference

53

 D. Koller and N. Friedman, “Probabilistic Graphical Models:

Principles and Techniques”, MIT Press, 2009 [Chapter 4].