Probabilistic Graphical Models CMSC 691 UMBC Two Problems for - - PowerPoint PPT Presentation

β–Ά
probabilistic graphical models
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Graphical Models CMSC 691 UMBC Two Problems for - - PowerPoint PPT Presentation

Probabilistic Graphical Models CMSC 691 UMBC Two Problems for Graphical Models 1 , 2 , 3 , , = 1 Finding the normalizer Computing the marginals Two Problems for Graphical


slide-1
SLIDE 1

Probabilistic Graphical Models

CMSC 691 UMBC

slide-2
SLIDE 2

Two Problems for Graphical Models

Finding the normalizer Computing the marginals

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

slide-3
SLIDE 3

Two Problems for Graphical Models

Finding the normalizer π‘Ž = ෍

𝑦

ΰ·‘

𝑑

πœ”π‘‘(𝑦𝑑) Computing the marginals

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

slide-4
SLIDE 4

Two Problems for Graphical Models

Finding the normalizer π‘Ž = ෍

𝑦

ΰ·‘

𝑑

πœ”π‘‘(𝑦𝑑) Computing the marginals π‘Žπ‘œ(𝑀) = ෍

𝑦:π‘¦π‘œ=𝑀

ΰ·‘

𝑑

πœ”π‘‘(𝑦𝑑)

Sum over all variable combinations, with the xn coordinate fixed π‘Ž2(𝑀) = ෍

𝑦1

෍

𝑦3

ΰ·‘

𝑑

πœ”π‘‘(𝑦 = 𝑦1, 𝑀, 𝑦3 ) Example: 3 variables, fix the 2nd dimension

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

slide-5
SLIDE 5

Two Problems for Graphical Models

Finding the normalizer π‘Ž = ෍

𝑦

ΰ·‘

𝑑

πœ”π‘‘(𝑦𝑑) Computing the marginals π‘Žπ‘œ(𝑀) = ෍

𝑦:π‘¦π‘œ=𝑀

ΰ·‘

𝑑

πœ”π‘‘(𝑦𝑑)

Q: Why are these difficult? A: Many different combinations Sum over all variable combinations, with the xn coordinate fixed π‘Ž2(𝑀) = ෍

𝑦1

෍

𝑦3

ΰ·‘

𝑑

πœ”π‘‘(𝑦 = 𝑦1, 𝑀, 𝑦3 ) Example: 3 variables, fix the 2nd dimension

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

slide-6
SLIDE 6

Probabilistic Graphical Models

A graph G that represents a probability distribution over random variables π‘Œ1, … , π‘Œπ‘‚

slide-7
SLIDE 7

Probabilistic Graphical Models

A graph G that represents a probability distribution over random variables π‘Œ1, … , π‘Œπ‘‚ Graph G = (vertices V, edges E) Distribution π‘ž(π‘Œ1, … , π‘Œπ‘‚)

slide-8
SLIDE 8

Probabilistic Graphical Models

A graph G that represents a probability distribution

  • ver random variables π‘Œ1, … , π‘Œπ‘‚

Graph G = (vertices V, edges E) Distribution π‘ž(π‘Œ1, … , π‘Œπ‘‚) Vertices ↔ random variables Edges show dependencies among random variables

slide-9
SLIDE 9

Probabilistic Graphical Models

A graph G that represents a probability distribution

  • ver random variables π‘Œ1, … , π‘Œπ‘‚

Graph G = (vertices V, edges E) Distribution π‘ž(π‘Œ1, … , π‘Œπ‘‚) Vertices ↔ random variables Edges show dependencies among random variables Two main flavors: directed graphical models and undirected graphical models

slide-10
SLIDE 10

Outline

Directed Graphical Models Undirected Graphical Models Factor Graphs

slide-11
SLIDE 11

Directed Graphical Models

A directed (acyclic) graph G=(V,E) that represents a probability distribution over random variables π‘Œ1, … , π‘Œπ‘‚ Joint probability factorizes into factors of π‘Œπ‘— conditioned on the parents of π‘Œπ‘—

slide-12
SLIDE 12

Directed Graphical Models

A directed (acyclic) graph G=(V,E) that represents a probability distribution over random variables π‘Œ1, … , π‘Œπ‘‚ Joint probability factorizes into factors of π‘Œπ‘— conditioned on the parents of π‘Œπ‘—

Benefit: read the independence properties are transparent

slide-13
SLIDE 13

Directed Graphical Models

A directed (acyclic) graph G=(V,E) that represents a probability distribution over random variables π‘Œ1, … , π‘Œπ‘‚ Joint probability factorizes into factors of π‘Œπ‘— conditioned on the parents of π‘Œπ‘— A graph/joint distribution that follows this is a

Bayesian network

slide-14
SLIDE 14

Bayesian Networks: Directed Acyclic Graphs

𝑦1 𝑦4 𝑦3 5 𝑦2

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = ΰ·‘

𝑗

π‘ž 𝑦𝑗 𝜌(𝑦𝑗))

β€œparents of” topological sort

slide-15
SLIDE 15

Bayesian Networks: Directed Acyclic Graphs

𝑦1 𝑦4 𝑦3 5 𝑦2

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = ΰ·‘

𝑗

π‘ž 𝑦𝑗 𝜌(𝑦𝑗))

π‘ž 𝑦1, 𝑦2, 𝑦3, 𝑦4, 𝑦5 = ???

slide-16
SLIDE 16

Bayesian Networks: Directed Acyclic Graphs

𝑦1 𝑦4 𝑦3 5 𝑦2

π‘ž 𝑦1, 𝑦2, 𝑦3, 𝑦4, 𝑦5 = π‘ž 𝑦1 π‘ž 𝑦3 π‘ž 𝑦2 𝑦1, 𝑦3 π‘ž 𝑦4 𝑦2, 𝑦3 π‘ž(𝑦5|𝑦2, 𝑦4)

slide-17
SLIDE 17

Bayesian Networks: Directed Acyclic Graphs

𝑦1 𝑦4 𝑦3 5 𝑦2

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = ΰ·‘

𝑗

π‘ž 𝑦𝑗 𝜌(𝑦𝑗))

exact inference in general DAGs is NP-hard inference in trees can be exact

slide-18
SLIDE 18

Directed Graphical Model Notation

𝑦1 𝑦4 𝑦3 5 𝑦2

Shaded nodes are

  • bserved R.V.s

Unshaded nodes are unobserved (latent) R.V.s

slide-19
SLIDE 19

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z

d-separation

P has a chain with an observed middle node P has a fork with an observed parent node P includes a β€œv-structure” or β€œcollider” with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Y X Y X Z Y

slide-20
SLIDE 20

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z

d-separation

P has a chain with an observed middle node P has a fork with an observed parent node P includes a β€œv-structure” or β€œcollider” with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Z Y X Z Y X Z Y

  • bserving Z blocks

the path from X to Y

  • bserving Z blocks

the path from X to Y not observing Z blocks the path from X to Y

slide-21
SLIDE 21

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z

d-separation

P has a chain with an observed middle node P has a fork with an observed parent node P includes a β€œv-structure” or β€œcollider” with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Z Y X Z Y X Z Y

  • bserving Z blocks

the path from X to Y

  • bserving Z blocks

the path from X to Y not observing Z blocks the path from X to Y π‘ž 𝑦, 𝑧, 𝑨 = π‘ž 𝑦 π‘ž 𝑧 π‘ž(𝑨|𝑦, 𝑧) π‘ž 𝑦, 𝑧 = ෍

𝑨

π‘ž 𝑦 π‘ž 𝑧 π‘ž(𝑨|𝑦, 𝑧) = π‘ž 𝑦 π‘ž 𝑧

slide-22
SLIDE 22

Markov Blanket

x Markov blanket of a node x is its parents, children, and children's parents

π‘ž 𝑦𝑗 π‘¦π‘˜β‰ π‘— = π‘ž(𝑦1, … , 𝑦𝑂) ∫ π‘ž 𝑦1, … , 𝑦𝑂 𝑒𝑦𝑗 = ς𝑙 π‘ž(𝑦𝑙|𝜌 𝑦𝑙 ) ∫ ς𝑙 π‘ž 𝑦𝑙 𝜌 𝑦𝑙 ) 𝑒𝑦𝑗

factor out terms not dependent on xi

factorization

  • f graph

= ς𝑙:𝑙=𝑗 or π‘—βˆˆπœŒ 𝑦𝑙 π‘ž(𝑦𝑙|𝜌 𝑦𝑙 ) ∫ ς𝑙:𝑙=𝑗 or π‘—βˆˆπœŒ 𝑦𝑙 π‘ž 𝑦𝑙 𝜌 𝑦𝑙 ) 𝑒𝑦𝑗

the set of nodes needed to form the complete conditional for a variable xi

(in this example, shading does not show

  • bserved/latent)
slide-23
SLIDE 23

Outline

Directed Graphical Models Undirected Graphical Models Factor Graphs

slide-24
SLIDE 24

Undirected Graphical Models

An undirected graph G=(V,E) that represents a probability distribution over random variables π‘Œ1, … , π‘Œπ‘‚ Joint probability factorizes based on cliques in the graph

slide-25
SLIDE 25

Undirected Graphical Models

An undirected graph G=(V,E) that represents a probability distribution over random variables π‘Œ1, … , π‘Œπ‘‚ Joint probability factorizes based on cliques in the graph Common name: Markov Random Fields

slide-26
SLIDE 26

Undirected Graphical Models

An undirected graph G=(V,E) that represents a probability distribution over random variables π‘Œ1, … , π‘Œπ‘‚ Joint probability factorizes based on cliques in the graph Common name: Markov Random Fields Undirected graphs can have an alternative formulation as Factor Graphs

slide-27
SLIDE 27

Markov Random Fields: Undirected Graphs

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂

slide-28
SLIDE 28

Markov Random Fields: Undirected Graphs

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂

clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique

slide-29
SLIDE 29

Markov Random Fields: Undirected Graphs

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

variables part

  • f the clique C

maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!)

slide-30
SLIDE 30

Markov Random Fields: Undirected Graphs

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

variables part

  • f the clique C

maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!)

slide-31
SLIDE 31

Markov Random Fields: Undirected Graphs

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

variables part

  • f the clique C

maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!) Q: What restrictions should we place on the potentials πœ”π·?

slide-32
SLIDE 32

Markov Random Fields: Undirected Graphs

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

variables part

  • f the clique C

maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!) Q: What restrictions should we place on the potentials πœ”π·? A: πœ”π· β‰₯ 0 (or πœ”π· > 0)

slide-33
SLIDE 33

Terminology: Potential Functions

πœ”π· 𝑦𝑑 = exp βˆ’πΉ(𝑦𝐷)

energy function (for clique C) Boltzmann distribution

(get the total energy of a configuration by summing the individual energy functions)

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

slide-34
SLIDE 34

Ambiguity in Undirected Model Notation

X Y Z

π‘ž 𝑦, 𝑧, 𝑨 ∝ πœ”(𝑦, 𝑧, 𝑨) π‘ž 𝑦, 𝑧, 𝑨 ∝ πœ”1 𝑦,𝑧 πœ”2 𝑧,𝑨 πœ”3 𝑦,𝑨

slide-35
SLIDE 35

Outline

Directed Graphical Models Undirected Graphical Models Factor Graphs

slide-36
SLIDE 36

MRFs as Factor Graphs

Undirected graphs: G=(V,E) that represents π‘ž(π‘Œ1, … , π‘Œπ‘‚) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables Factor nodes F take values associated with the potential functions Edges show what variables are used in which factors

slide-37
SLIDE 37

MRFs as Factor Graphs

Undirected graphs: G=(V,E) that represents π‘ž(π‘Œ1, … , π‘Œπ‘‚) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T

X Y Z

slide-38
SLIDE 38

MRFs as Factor Graphs

Undirected graphs: G=(V,E) that represents π‘ž(π‘Œ1, … , π‘Œπ‘‚) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables

X Y Z X Y Z

slide-39
SLIDE 39

MRFs as Factor Graphs

Undirected graphs: G=(V,E) that represents π‘ž(π‘Œ1, … , π‘Œπ‘‚) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables Factor nodes F take values associated with the potential functions

X Y Z X Y Z

slide-40
SLIDE 40

MRFs as Factor Graphs

Undirected graphs: G=(V,E) that represents π‘ž(π‘Œ1, … , π‘Œπ‘‚) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables Factor nodes F take values associated with the potential functions Edges show what variables are used in which factors

X Y Z X Y Z

slide-41
SLIDE 41

Different Factor Graph Notation for the Same Graph

X Y Z X Y Z X Y Z

slide-42
SLIDE 42

Directed vs. Undirected Models: Moralization

x1 x2 x3 x4

slide-43
SLIDE 43

Directed vs. Undirected Models: Moralization

x1 x2 x3 x4

π‘ž 𝑦1, … , 𝑦4 = π‘ž 𝑦1 π‘ž 𝑦2 π‘ž 𝑦3 π‘ž(𝑦4|𝑦1, 𝑦2, 𝑦3)

x1 x2 x3 x4

slide-44
SLIDE 44

Directed vs. Undirected Models: Moralization

x1 x2 x3 x4

π‘ž 𝑦1, … , 𝑦4 = π‘ž 𝑦1 π‘ž 𝑦2 π‘ž 𝑦3 π‘ž(𝑦4|𝑦1, 𝑦2, 𝑦3)

x1 x2 x3 x4 parents of nodes in a directed graph must be connected in an undirected graph

slide-45
SLIDE 45

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

slide-46
SLIDE 46

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

slide-47
SLIDE 47

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Undirected

(e.g., conditional random field [CRF]) z1

w1 w2 w3 w4

z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

slide-48
SLIDE 48

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Undirected as factor graph

(e.g., conditional random field [CRF]) z1 z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

slide-49
SLIDE 49

Example: Linear Chain Conditional Random Field

Widely used in applications like part-of-speech tagging

z1 z2 z3 z4

President Obama told Congress …

Noun-Mod Noun Noun Verb

slide-50
SLIDE 50

Example: Linear Chain Conditional Random Field

Widely used in applications like part-of-speech tagging and named entity recognition

z1 z2 z3 z4

President Obama told Congress …

Noun-Mod Noun Noun Verb

President Obama told Congress …

Person Person Org. Other

slide-51
SLIDE 51

Linear Chain CRFs for Part of Speech Tagging

A linear chain CRF is a conditional probabilistic model of the sequence of tags 𝑨1, 𝑨2, … , 𝑨𝑂 conditioned on the entire input sequence 𝑦1:𝑂

slide-52
SLIDE 52

Linear Chain CRFs for Part of Speech Tagging

π‘ž ♣|β™’

A linear chain CRF is a conditional probabilistic model of the sequence of tags 𝑨1, 𝑨2, … , 𝑨𝑂 conditioned on the entire input sequence 𝑦1:𝑂

slide-53
SLIDE 53

Linear Chain CRFs for Part of Speech Tagging

π‘ž 𝑨1, 𝑨2, … , 𝑨𝑂|β™’

A linear chain CRF is a conditional probabilistic model of the sequence of tags 𝑨1, 𝑨2, … , 𝑨𝑂 conditioned on the entire input sequence 𝑦1:𝑂

slide-54
SLIDE 54

Linear Chain CRFs for Part of Speech Tagging

π‘ž 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂

A linear chain CRF is a conditional probabilistic model of the sequence of tags 𝑨1, 𝑨2, … , 𝑨𝑂 conditioned on the entire input sequence 𝑦1:𝑂

slide-55
SLIDE 55

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘ž 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂

slide-56
SLIDE 56

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘ž 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂 ∝ ΰ·‘

i=1 N

exp( πœ„ 𝑔 , 𝑔

𝑗 𝑨𝑗

+ πœ„ 𝑕 , 𝑕𝑗 𝑨𝑗, 𝑨𝑗+1 )

slide-57
SLIDE 57

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘•π‘˜: inter-tag features (can depend on any/all input words 𝑦1:𝑂)

slide-58
SLIDE 58

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘•π‘˜: inter-tag features (can depend on any/all input words 𝑦1:𝑂) 𝑔

𝑗: solo tag features

(can depend on any/all input words 𝑦1:𝑂)

slide-59
SLIDE 59

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘•π‘˜: inter-tag features (can depend on any/all input words 𝑦1:𝑂) 𝑔

𝑗: solo tag features

(can depend on any/all input words 𝑦1:𝑂)

Feature design, just like in maxent models!

slide-60
SLIDE 60

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘•π‘˜: inter-tag features (can depend on any/all input words 𝑦1:𝑂) 𝑔

𝑗: solo tag features

(can depend on any/all input words 𝑦1:𝑂) Example: π‘•π‘˜,π‘‚β†’π‘Š zj, zj+1 = 1 (if zj == N & zj+1 == V) else 0 π‘•π‘˜,told,π‘‚β†’π‘Š zj, zj+1 = 1 (if zj == N & zj+1 == V & xj == told) else 0

slide-61
SLIDE 61

Outline

Directed Graphical Models Undirected Graphical Models Factor Graphs