Probabilistic Graphical Models CMSC 678 UMBC Probabilistic - - PowerPoint PPT Presentation

β–Ά
probabilistic graphical models
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic - - PowerPoint PPT Presentation

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that represents a probability distribution over random variables 1 , , Probabilistic Graphical Models A graph G that represents a


slide-1
SLIDE 1

Probabilistic Graphical Models

CMSC 678 UMBC

slide-2
SLIDE 2

Probabilistic Graphical Models

A graph G that represents a probability distribution over random variables π‘Œ1, … , π‘Œπ‘‚

slide-3
SLIDE 3

Probabilistic Graphical Models

A graph G that represents a probability distribution over random variables π‘Œ1, … , π‘Œπ‘‚ Graph G = (vertices V, edges E) Distribution π‘ž(π‘Œ1, … , π‘Œπ‘‚)

slide-4
SLIDE 4

Probabilistic Graphical Models

A graph G that represents a probability distribution

  • ver random variables π‘Œ1, … , π‘Œπ‘‚

Graph G = (vertices V, edges E) Distribution π‘ž(π‘Œ1, … , π‘Œπ‘‚) Vertices ↔ random variables Edges show dependencies among random variables

slide-5
SLIDE 5

Probabilistic Graphical Models

A graph G that represents a probability distribution

  • ver random variables π‘Œ1, … , π‘Œπ‘‚

Graph G = (vertices V, edges E) Distribution π‘ž(π‘Œ1, … , π‘Œπ‘‚) Vertices ↔ random variables Edges show dependencies among random variables Two main flavors: directed graphical models and undirected graphical models

slide-6
SLIDE 6

Outline

Directed Graphical Models NaΓ―ve Bayes Undirected Graphical Models Factor Graphs Ising Model Message Passing: Graphical Model Inference

slide-7
SLIDE 7

Directed Graphical Models

A directed (acyclic) graph G=(V,E) that represents a probability distribution over random variables π‘Œ1, … , π‘Œπ‘‚ Joint probability factorizes into factors of π‘Œπ‘— conditioned on the parents of π‘Œπ‘—

slide-8
SLIDE 8

Directed Graphical Models

A directed (acyclic) graph G=(V,E) that represents a probability distribution over random variables π‘Œ1, … , π‘Œπ‘‚ Joint probability factorizes into factors of π‘Œπ‘— conditioned on the parents of π‘Œπ‘—

Benefit: read the independence properties are transparent

slide-9
SLIDE 9

Directed Graphical Models

A directed (acyclic) graph G=(V,E) that represents a probability distribution over random variables π‘Œ1, … , π‘Œπ‘‚ Joint probability factorizes into factors of π‘Œπ‘— conditioned on the parents of π‘Œπ‘— A graph/joint distribution that follows this is a

Bayesian network

slide-10
SLIDE 10

Bayesian Networks: Directed Acyclic Graphs

𝑦1 𝑦4 𝑦3 5 𝑦2

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = ΰ·‘

𝑗

π‘ž 𝑦𝑗 𝜌(𝑦𝑗))

β€œparents of” topological sort

slide-11
SLIDE 11

Bayesian Networks: Directed Acyclic Graphs

𝑦1 𝑦4 𝑦3 5 𝑦2

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = ΰ·‘

𝑗

π‘ž 𝑦𝑗 𝜌(𝑦𝑗))

π‘ž 𝑦1, 𝑦2, 𝑦3, 𝑦4, 𝑦5 = ???

slide-12
SLIDE 12

Bayesian Networks: Directed Acyclic Graphs

𝑦1 𝑦4 𝑦3 5 𝑦2

π‘ž 𝑦1, 𝑦2, 𝑦3, 𝑦4, 𝑦5 = π‘ž 𝑦1 π‘ž 𝑦3 π‘ž 𝑦2 𝑦1, 𝑦3 π‘ž 𝑦4 𝑦2, 𝑦3 π‘ž(𝑦5|𝑦2, 𝑦4)

slide-13
SLIDE 13

Bayesian Networks: Directed Acyclic Graphs

𝑦1 𝑦4 𝑦3 5 𝑦2

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = ΰ·‘

𝑗

π‘ž 𝑦𝑗 𝜌(𝑦𝑗))

exact inference in general DAGs is NP-hard inference in trees can be exact

slide-14
SLIDE 14

Directed Graphical Model Notation

𝑦1 𝑦4 𝑦3 5 𝑦2

Shaded nodes are

  • bserved R.V.s

Unshaded nodes are unobserved (latent) R.V.s

slide-15
SLIDE 15

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z

d-separation

P has a chain with an observed middle node P has a fork with an observed parent node P includes a β€œv-structure” or β€œcollider” with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Y X Y X Z Y

slide-16
SLIDE 16

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z

d-separation

P has a chain with an observed middle node P has a fork with an observed parent node P includes a β€œv-structure” or β€œcollider” with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Z Y X Z Y X Z Y

  • bserving Z blocks

the path from X to Y

  • bserving Z blocks

the path from X to Y not observing Z blocks the path from X to Y

slide-17
SLIDE 17

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z

d-separation

P has a chain with an observed middle node P has a fork with an observed parent node P includes a β€œv-structure” or β€œcollider” with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Z Y X Z Y X Z Y

  • bserving Z blocks

the path from X to Y

  • bserving Z blocks

the path from X to Y not observing Z blocks the path from X to Y π‘ž 𝑦, 𝑧, 𝑨 = π‘ž 𝑦 π‘ž 𝑧 π‘ž(𝑨|𝑦, 𝑧) π‘ž 𝑦, 𝑧 = ෍

𝑨

π‘ž 𝑦 π‘ž 𝑧 π‘ž(𝑨|𝑦, 𝑧) = π‘ž 𝑦 π‘ž 𝑧

slide-18
SLIDE 18

Markov Blanket

x Markov blanket of a node x is its parents, children, and children's parents

π‘ž 𝑦𝑗 π‘¦π‘˜β‰ π‘— = π‘ž(𝑦1, … , 𝑦𝑂) ∫ π‘ž 𝑦1, … , 𝑦𝑂 𝑒𝑦𝑗 = ς𝑙 π‘ž(𝑦𝑙|𝜌 𝑦𝑙 ) ∫ ς𝑙 π‘ž 𝑦𝑙 𝜌 𝑦𝑙 ) 𝑒𝑦𝑗

factor out terms not dependent on xi

factorization

  • f graph

= ς𝑙:𝑙=𝑗 or π‘—βˆˆπœŒ 𝑦𝑙 π‘ž(𝑦𝑙|𝜌 𝑦𝑙 ) ∫ ς𝑙:𝑙=𝑗 or π‘—βˆˆπœŒ 𝑦𝑙 π‘ž 𝑦𝑙 𝜌 𝑦𝑙 ) 𝑒𝑦𝑗

the set of nodes needed to form the complete conditional for a variable xi

(in this example, shading does not show

  • bserved/latent)
slide-19
SLIDE 19

Outline

Directed Graphical Models NaΓ―ve Bayes Undirected Graphical Models Factor Graphs Ising Model Message Passing: Graphical Model Inference

slide-20
SLIDE 20

NaΓ―ve Bayes argmax𝑍 log π‘ž π‘Œ 𝑍) + log π‘ž(𝑍)

likelihood prior

argmaxπ‘π‘ž 𝑍 π‘Œ)

Apply Bayes rule and take logs

slide-21
SLIDE 21

NaΓ―ve Bayes argmax𝑍 log π‘ž π‘Œ 𝑍) + log π‘ž(𝑍) argmaxπ‘π‘ž 𝑍 π‘Œ)

Apply Bayes rule and take logs

Represent X is a D-dimensional vector (of features): π‘Œ = (π‘Œ1, π‘Œ2, π‘Œ3, … , π‘ŒπΈ)

slide-22
SLIDE 22

NaΓ―ve Bayes argmax𝑍 log π‘ž π‘Œ 𝑍) + log π‘ž(𝑍) argmaxπ‘π‘ž 𝑍 π‘Œ)

argmax𝑍 ෍

π‘˜=1 𝐸

log π‘ž(π‘Œ

π‘˜|𝑍) + log π‘ž(𝑍)

Apply Bayes rule and take logs Naively generate each β€œfeature”

  • f X, conditioned on Y
slide-23
SLIDE 23

The Bag of Words Representation

Adapted from Jurafsky & Martin (draft)

slide-24
SLIDE 24

The Bag of Words Representation

Adapted from Jurafsky & Martin (draft)

slide-25
SLIDE 25

The Bag of Words Representation

25

Adapted from Jurafsky & Martin (draft)

slide-26
SLIDE 26

Bag of Words Representation

Ξ³( )=c

seen 2 sweet 1 whimsical 1 recommend 1 happy 1 ... ...

classifier classifier

Adapted from Jurafsky & Martin (draft)

slide-27
SLIDE 27

NaΓ―ve Bayes: A Generative Story

Generative Story

𝜚 = distribution over 𝐿 labels for label 𝑙 = 1 to 𝐿:

global parameters

πœ„π‘™ = generate parameters

π‘ž π‘¦π‘—π‘˜ 𝑧 = 𝑙) π‘ž(𝑧 = 𝑙)

෍

π‘˜=1 𝐸

log π‘ž(π‘Œπ‘—π‘˜|𝑍

𝑗) + log π‘ž(𝑍 𝑗)

slide-28
SLIDE 28

NaΓ―ve Bayes: A Generative Story

for item 𝑗 = 1 to 𝑂: 𝑧𝑗 ~ Cat 𝜚

Generative Story

𝜚 = distribution over 𝐿 labels

y

for label 𝑙 = 1 to 𝐿: πœ„π‘™ = generate parameters

Choose the label

෍

π‘˜=1 𝐸

log π‘ž(π‘Œπ‘—π‘˜|𝑍

𝑗) + log π‘ž(𝑍 𝑗)

slide-29
SLIDE 29

NaΓ―ve Bayes: A Generative Story

for item 𝑗 = 1 to 𝑂: 𝑧𝑗 ~ Cat 𝜚

Generative Story

𝜚 = distribution over 𝐿 labels for each feature π‘˜ π‘¦π‘—π‘˜ ∼ Fπ‘˜(πœ„π‘§π‘—)

𝑦𝑗1 𝑦𝑗2 𝑦𝑗3 𝑦𝑗4 𝑦𝑗5

y

for label 𝑙 = 1 to 𝐿: πœ„π‘™ = generate parameters

local variables Generate each feature based on the label

෍

π‘˜=1 𝐸

log π‘ž(π‘Œπ‘—π‘˜|𝑍

𝑗) + log π‘ž(𝑍 𝑗)

slide-30
SLIDE 30

NaΓ―ve Bayes: A Generative Story

for item 𝑗 = 1 to 𝑂: 𝑧𝑗 ~ Cat 𝜚

Generative Story

𝜚 = distribution over 𝐿 labels

𝑦𝑗1 𝑦𝑗2 𝑦𝑗3 𝑦𝑗4 𝑦𝑗5

y

for label 𝑙 = 1 to 𝐿:

each xij is conditionally independent of one another (given the label)

πœ„π‘™ = generate parameters for each feature π‘˜ π‘¦π‘—π‘˜ ∼ Fπ‘˜(πœ„π‘§π‘—)

෍

π‘˜=1 𝐸

log π‘ž(π‘Œπ‘—π‘˜|𝑍

𝑗) + log π‘ž(𝑍 𝑗)

slide-31
SLIDE 31

NaΓ―ve Bayes: A Generative Story

for item 𝑗 = 1 to 𝑂: 𝑧𝑗 ~ Cat 𝜚

Generative Story

β„’ πœ„ = ෍

𝑗

෍

π‘˜

log 𝐺

𝑧𝑗(π‘¦π‘—π‘˜; πœ„π‘§π‘—) + ෍ 𝑗

log πœšπ‘§π‘— s. t.

Maximize Log-likelihood

𝜚 = distribution over 𝐿 labels

𝑦𝑗1 𝑦𝑗2 𝑦𝑗3 𝑦𝑗4 𝑦𝑗5

y

for label 𝑙 = 1 to 𝐿:

෍

𝑙

πœšπ‘™ = 1 πœ„π‘™ is valid for 𝐺

π‘˜

πœ„π‘™ = generate parameters for each feature π‘˜ π‘¦π‘—π‘˜ ∼ Fπ‘˜(πœ„π‘§π‘—)

πœšπ‘™ β‰₯ 0

slide-32
SLIDE 32

Multinomial NaΓ―ve Bayes: A Generative Story

for item 𝑗 = 1 to 𝑂: 𝑧𝑗 ~ Cat 𝜚

Generative Story

β„’ πœ„ = ෍

𝑗

෍

π‘˜

log πœ„π‘§π‘—,𝑦𝑗,π‘˜ + ෍

𝑗

log πœšπ‘§π‘— s. t.

Maximize Log-likelihood

𝜚 = distribution over 𝐿 labels for each feature π‘˜ π‘¦π‘—π‘˜ ∼ Cat(πœ„π‘§π‘—)

𝑦𝑗1 𝑦𝑗2 𝑦𝑗3 𝑦𝑗4 𝑦𝑗5

y

for label 𝑙 = 1 to 𝐿:

πœ„π‘™ = distribution over J feature values

෍

𝑙

πœšπ‘™ = 1 ෍

π‘˜

πœ„π‘™π‘˜ = 1 βˆ€π‘™ πœ„π‘™π‘˜ β‰₯ 0, πœšπ‘™ β‰₯ 0

slide-33
SLIDE 33

Multinomial NaΓ―ve Bayes: A Generative Story

for item 𝑗 = 1 to 𝑂: 𝑧𝑗 ~ Cat 𝜚

Generative Story

β„’ πœ„ = ෍

𝑗

෍

π‘˜

log πœ„π‘§π‘—,𝑦𝑗,π‘˜ + ෍

𝑗

log πœšπ‘§π‘— βˆ’ 𝜈 ෍

𝑙

πœšπ‘™ βˆ’ 1 βˆ’ ෍

𝑙

πœ‡π‘™ ෍

π‘˜

πœ„π‘™π‘˜ βˆ’ 1

Maximize Log-likelihood via Lagrange Multipliers (β‰₯ 𝟏 constraints not shown)

𝜚 = distribution over 𝐿 labels for each feature π‘˜ π‘¦π‘—π‘˜ ∼ Cat(πœ„π‘§π‘—,π‘˜)

𝑦𝑗1 𝑦𝑗2 𝑦𝑗3 𝑦𝑗4 𝑦𝑗5

y

for label 𝑙 = 1 to 𝐿:

πœ„π‘™ = distribution over J feature values

slide-34
SLIDE 34

Multinomial NaΓ―ve Bayes: Learning

Calculate class priors For each k:

itemsk = all items with class = k

Calculate feature generation terms For each k:

  • bsk = single object containing all

items labeled as k Foreach feature j nkj = # of occurrences of j in obsk

π‘ž 𝑙 = |items𝑙| # items π‘ž π‘˜|𝑙 = π‘œπ‘™π‘˜ Οƒπ‘˜β€² π‘œπ‘™π‘˜β€²

slide-35
SLIDE 35

Brill and Banko (2001) With enough data, the classifier may not matter

Adapted from Jurafsky & Martin (draft)

slide-36
SLIDE 36

Summary: NaΓ―ve Bayes is Not So NaΓ―ve, but not without issue

Pro

Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)

Con

Model the posterior in one go? (e.g., use conditional maxent) Are the features really uncorrelated? Are plain counts always appropriate? Are there β€œbetter” ways of handling missing/noisy data?

(automated, more principled)

Adapted from Jurafsky & Martin (draft)

slide-37
SLIDE 37

Outline

Directed Graphical Models NaΓ―ve Bayes Undirected Graphical Models Factor Graphs Ising Model Message Passing: Graphical Model Inference

slide-38
SLIDE 38

Undirected Graphical Models

An undirected graph G=(V,E) that represents a probability distribution over random variables π‘Œ1, … , π‘Œπ‘‚ Joint probability factorizes based on cliques in the graph

slide-39
SLIDE 39

Undirected Graphical Models

An undirected graph G=(V,E) that represents a probability distribution over random variables π‘Œ1, … , π‘Œπ‘‚ Joint probability factorizes based on cliques in the graph Common name: Markov Random Fields

slide-40
SLIDE 40

Undirected Graphical Models

An undirected graph G=(V,E) that represents a probability distribution over random variables π‘Œ1, … , π‘Œπ‘‚ Joint probability factorizes based on cliques in the graph Common name: Markov Random Fields Undirected graphs can have an alternative formulation as Factor Graphs

slide-41
SLIDE 41

Markov Random Fields: Undirected Graphs

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂

slide-42
SLIDE 42

Markov Random Fields: Undirected Graphs

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂

clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique

slide-43
SLIDE 43

Markov Random Fields: Undirected Graphs

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

variables part

  • f the clique C

maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!)

slide-44
SLIDE 44

Markov Random Fields: Undirected Graphs

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

variables part

  • f the clique C

maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!)

slide-45
SLIDE 45

Markov Random Fields: Undirected Graphs

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

variables part

  • f the clique C

maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!) Q: What restrictions should we place on the potentials πœ”π·?

slide-46
SLIDE 46

Markov Random Fields: Undirected Graphs

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

variables part

  • f the clique C

maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!) Q: What restrictions should we place on the potentials πœ”π·? A: πœ”π· β‰₯ 0 (or πœ”π· > 0)

slide-47
SLIDE 47

Terminology: Potential Functions

πœ”π· 𝑦𝑑 = exp βˆ’πΉ(𝑦𝐷)

energy function (for clique C) Boltzmann distribution

(get the total energy of a configuration by summing the individual energy functions)

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

slide-48
SLIDE 48

Ambiguity in Undirected Model Notation

X Y Z

π‘ž 𝑦, 𝑧, 𝑨 ∝ πœ”(𝑦, 𝑧, 𝑨) π‘ž 𝑦, 𝑧, 𝑨 ∝ πœ”1 𝑦,𝑧 πœ”2 𝑧,𝑨 πœ”3 𝑦,𝑨

slide-49
SLIDE 49

Outline

Directed Graphical Models NaΓ―ve Bayes Undirected Graphical Models Factor Graphs Ising Model Message Passing: Graphical Model Inference

slide-50
SLIDE 50

MRFs as Factor Graphs

Undirected graphs: G=(V,E) that represents π‘ž(π‘Œ1, … , π‘Œπ‘‚) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables Factor nodes F take values associated with the potential functions Edges show what variables are used in which factors

slide-51
SLIDE 51

MRFs as Factor Graphs

Undirected graphs: G=(V,E) that represents π‘ž(π‘Œ1, … , π‘Œπ‘‚) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T

X Y Z

slide-52
SLIDE 52

MRFs as Factor Graphs

Undirected graphs: G=(V,E) that represents π‘ž(π‘Œ1, … , π‘Œπ‘‚) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables

X Y Z X Y Z

slide-53
SLIDE 53

MRFs as Factor Graphs

Undirected graphs: G=(V,E) that represents π‘ž(π‘Œ1, … , π‘Œπ‘‚) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables Factor nodes F take values associated with the potential functions

X Y Z X Y Z

slide-54
SLIDE 54

MRFs as Factor Graphs

Undirected graphs: G=(V,E) that represents π‘ž(π‘Œ1, … , π‘Œπ‘‚) Factor graph of p: Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables Factor nodes F take values associated with the potential functions Edges show what variables are used in which factors

X Y Z X Y Z

slide-55
SLIDE 55

Different Factor Graph Notation for the Same Graph

X Y Z X Y Z X Y Z

slide-56
SLIDE 56

Directed vs. Undirected Models: Moralization

x1 x2 x3 x4

slide-57
SLIDE 57

Directed vs. Undirected Models: Moralization

x1 x2 x3 x4

π‘ž 𝑦1, … , 𝑦4 = π‘ž 𝑦1 π‘ž 𝑦2 π‘ž 𝑦3 π‘ž(𝑦4|𝑦1, 𝑦2, 𝑦3)

x1 x2 x3 x4

slide-58
SLIDE 58

Directed vs. Undirected Models: Moralization

x1 x2 x3 x4

π‘ž 𝑦1, … , 𝑦4 = π‘ž 𝑦1 π‘ž 𝑦2 π‘ž 𝑦3 π‘ž(𝑦4|𝑦1, 𝑦2, 𝑦3)

x1 x2 x3 x4 parents of nodes in a directed graph must be connected in an undirected graph

slide-59
SLIDE 59

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

slide-60
SLIDE 60

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

slide-61
SLIDE 61

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Undirected

(e.g., conditional random field [CRF]) z1

w1 w2 w3 w4

z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

slide-62
SLIDE 62

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Undirected as factor graph

(e.g., conditional random field [CRF]) z1 z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

slide-63
SLIDE 63

Example: Linear Chain Conditional Random Field

Widely used in applications like part-of-speech tagging

z1 z2 z3 z4

President Obama told Congress …

Noun-Mod Noun Noun Verb

slide-64
SLIDE 64

Example: Linear Chain Conditional Random Field

Widely used in applications like part-of-speech tagging and named entity recognition

z1 z2 z3 z4

President Obama told Congress …

Noun-Mod Noun Noun Verb

President Obama told Congress …

Person Person Org. Other

slide-65
SLIDE 65

Linear Chain CRFs for Part of Speech Tagging

A linear chain CRF is a conditional probabilistic model of the sequence of tags 𝑨1, 𝑨2, … , 𝑨𝑂 conditioned on the entire input sequence 𝑦1:𝑂

slide-66
SLIDE 66

Linear Chain CRFs for Part of Speech Tagging

π‘ž ♣|β™’

A linear chain CRF is a conditional probabilistic model of the sequence of tags 𝑨1, 𝑨2, … , 𝑨𝑂 conditioned on the entire input sequence 𝑦1:𝑂

slide-67
SLIDE 67

Linear Chain CRFs for Part of Speech Tagging

π‘ž 𝑨1, 𝑨2, … , 𝑨𝑂|β™’

A linear chain CRF is a conditional probabilistic model of the sequence of tags 𝑨1, 𝑨2, … , 𝑨𝑂 conditioned on the entire input sequence 𝑦1:𝑂

slide-68
SLIDE 68

Linear Chain CRFs for Part of Speech Tagging

π‘ž 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂

A linear chain CRF is a conditional probabilistic model of the sequence of tags 𝑨1, 𝑨2, … , 𝑨𝑂 conditioned on the entire input sequence 𝑦1:𝑂

slide-69
SLIDE 69

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘ž 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂

slide-70
SLIDE 70

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘ž 𝑨1, 𝑨2, … , 𝑨𝑂|𝑦1:𝑂 ∝ ΰ·‘

i=1 N

exp( πœ„ 𝑔 , 𝑔

𝑗 𝑨𝑗

+ πœ„ 𝑕 , 𝑕𝑗 𝑨𝑗, 𝑨𝑗+1 )

slide-71
SLIDE 71

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘•π‘˜: inter-tag features (can depend on any/all input words 𝑦1:𝑂)

slide-72
SLIDE 72

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘•π‘˜: inter-tag features (can depend on any/all input words 𝑦1:𝑂) 𝑔

𝑗: solo tag features

(can depend on any/all input words 𝑦1:𝑂)

slide-73
SLIDE 73

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘•π‘˜: inter-tag features (can depend on any/all input words 𝑦1:𝑂) 𝑔

𝑗: solo tag features

(can depend on any/all input words 𝑦1:𝑂)

Feature design, just like in maxent models!

slide-74
SLIDE 74

Linear Chain CRFs for Part of Speech Tagging

z1 z2 z3 z4

𝑔

1

𝑔

2

𝑔

3

𝑔

4

𝑕1 𝑕2 𝑕3 𝑕4

π‘•π‘˜: inter-tag features (can depend on any/all input words 𝑦1:𝑂) 𝑔

𝑗: solo tag features

(can depend on any/all input words 𝑦1:𝑂) Example: π‘•π‘˜,π‘‚β†’π‘Š zj, zj+1 = 1 (if zj == N & zj+1 == V) else 0 π‘•π‘˜,told,π‘‚β†’π‘Š zj, zj+1 = 1 (if zj == N & zj+1 == V & xj == told) else 0

slide-75
SLIDE 75

Outline

Directed Graphical Models NaΓ―ve Bayes Undirected Graphical Models Factor Graphs Ising Model Message Passing: Graphical Model Inference

slide-76
SLIDE 76

Example: Ising Model

x: original pixel/state y:

  • bserved

(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)

  • riginal
w/ 10% noise

X Y

slide-77
SLIDE 77

Example: Ising Model

x: original pixel/state y:

  • bserved

(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)

  • riginal
w/ 10% noise two solutions

Q: What are the cliques?

slide-78
SLIDE 78

Example: Ising Model

x: original pixel/state y:

  • bserved

(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)

  • riginal
w/ 10% noise two solutions

Q: What are the cliques?

slide-79
SLIDE 79

Example: Ising Model

x: original pixel/state y:

  • bserved

(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)

  • riginal
w/ 10% noise two solutions

𝐹 𝑦, 𝑧 = β„Ž ෍

𝑗

𝑦𝑗 βˆ’ 𝛾 ෍

π‘—π‘˜

π‘¦π‘—π‘¦π‘˜ βˆ’ πœƒ ෍

𝑗

𝑦𝑗𝑧𝑗

xi and yi should be correlated neighboring pixels should be similar allow for a bias

slide-80
SLIDE 80

Example: Ising Model

x: original pixel/state y:

  • bserved

(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)

  • riginal
w/ 10% noise two solutions

𝐹 𝑦, 𝑧 = β„Ž ෍

𝑗

𝑦𝑗 βˆ’ 𝛾 ෍

π‘—π‘˜

π‘¦π‘—π‘¦π‘˜ βˆ’ πœƒ ෍

𝑗

𝑦𝑗𝑧𝑗

xi and yi should be correlated neighboring pixels should be similar allow for a bias

slide-81
SLIDE 81

Example: Ising Model

x: original pixel/state y:

  • bserved

(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)

  • riginal
w/ 10% noise two solutions

𝐹 𝑦, 𝑧 = β„Ž ෍

𝑗

𝑦𝑗 βˆ’ 𝛾 ෍

π‘—π‘˜

π‘¦π‘—π‘¦π‘˜ βˆ’ πœƒ ෍

𝑗

𝑦𝑗𝑧𝑗

xi and yi should be correlated neighboring pixels should be similar allow for a bias

Q: Why subtract Ξ² and Ξ·?

slide-82
SLIDE 82

Example: Ising Model

x: original pixel/state y:

  • bserved

(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)

  • riginal
w/ 10% noise two solutions

𝐹 𝑦, 𝑧 = β„Ž ෍

𝑗

𝑦𝑗 βˆ’ 𝛾 ෍

π‘—π‘˜

π‘¦π‘—π‘¦π‘˜ βˆ’ πœƒ ෍

𝑗

𝑦𝑗𝑧𝑗

xi and yi should be correlated neighboring pixels should be similar allow for a bias

Q: Why subtract Ξ² and Ξ·? A: Better states β†’ lower energy (higher potential) πœ”π· 𝑦𝑑 = exp βˆ’πΉ(𝑦𝐷)

slide-83
SLIDE 83

Markov Random Fields with Factor Graph Notation

x: original pixel/state y: observed (noisy) pixel/state

factor nodes are added according to maximal cliques

unary factor variable

factor graphs are bipartite

binary factor

slide-84
SLIDE 84

Outline

Directed Graphical Models NaΓ―ve Bayes Undirected Graphical Models Factor Graphs Ising Model Message Passing: Graphical Model Inference

slide-85
SLIDE 85

Two Problems for Undirected Models

Finding the normalizer Computing the marginals

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

slide-86
SLIDE 86

Two Problems for Undirected Models

Finding the normalizer π‘Ž = ෍

𝑦

ΰ·‘

𝑑

πœ”π‘‘(𝑦𝑑) Computing the marginals

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

slide-87
SLIDE 87

Two Problems for Undirected Models

Finding the normalizer π‘Ž = ෍

𝑦

ΰ·‘

𝑑

πœ”π‘‘(𝑦𝑑) Computing the marginals π‘Žπ‘œ(𝑀) = ෍

𝑦:π‘¦π‘œ=𝑀

ΰ·‘

𝑑

πœ”π‘‘(𝑦𝑑)

Sum over all variable combinations, with the xn coordinate fixed π‘Ž2(𝑀) = ෍

𝑦1

෍

𝑦3

ΰ·‘

𝑑

πœ”π‘‘(𝑦 = 𝑦1, 𝑀, 𝑦3 ) Example: 3 variables, fix the 2nd dimension

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

slide-88
SLIDE 88

Two Problems for Undirected Models

Finding the normalizer π‘Ž = ෍

𝑦

ΰ·‘

𝑑

πœ”π‘‘(𝑦𝑑) Computing the marginals π‘Žπ‘œ(𝑀) = ෍

𝑦:π‘¦π‘œ=𝑀

ΰ·‘

𝑑

πœ”π‘‘(𝑦𝑑)

Q: Why are these difficult? Sum over all variable combinations, with the xn coordinate fixed π‘Ž2(𝑀) = ෍

𝑦1

෍

𝑦3

ΰ·‘

𝑑

πœ”π‘‘(𝑦 = 𝑦1, 𝑀, 𝑦3 ) Example: 3 variables, fix the 2nd dimension

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

slide-89
SLIDE 89

Two Problems for Undirected Models

Finding the normalizer π‘Ž = ෍

𝑦

ΰ·‘

𝑑

πœ”π‘‘(𝑦𝑑) Computing the marginals π‘Žπ‘œ(𝑀) = ෍

𝑦:π‘¦π‘œ=𝑀

ΰ·‘

𝑑

πœ”π‘‘(𝑦𝑑)

Q: Why are these difficult? A: Many different combinations Sum over all variable combinations, with the xn coordinate fixed π‘Ž2(𝑀) = ෍

𝑦1

෍

𝑦3

ΰ·‘

𝑑

πœ”π‘‘(𝑦 = 𝑦1, 𝑀, 𝑦3 ) Example: 3 variables, fix the 2nd dimension

π‘ž 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 π‘Ž ΰ·‘

𝐷

πœ”π· 𝑦𝑑

slide-90
SLIDE 90

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number β€˜one’ to the soldier behind you. If you are the rearmost soldier in the line, say the number β€˜one’ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add

  • ne to it, and say the new

number to the soldier on the

  • ther side

ITILA, Ch 16

slide-91
SLIDE 91

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number β€˜one’ to the soldier behind you. If you are the rearmost soldier in the line, say the number β€˜one’ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

slide-92
SLIDE 92

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number β€˜one’ to the soldier behind you. If you are the rearmost soldier in the line, say the number β€˜one’ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

slide-93
SLIDE 93

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number β€˜one’ to the soldier behind you. If you are the rearmost soldier in the line, say the number β€˜one’ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

slide-94
SLIDE 94

Sum-Product Algorithm

Main idea: message passing An exact inference algorithm for tree-like graphs Belief propagation (forward-backward for HMMs) is a special case

slide-95
SLIDE 95

Sum-Product

π‘ž 𝑦𝑗 = 𝑀 = ΰ·‘

𝑦:𝑦𝑗=𝑀

π‘ž 𝑦1, 𝑦2, … , 𝑦𝑗, … , 𝑦𝑂

definition of marginal

… …

slide-96
SLIDE 96

Sum-Product

π‘ž 𝑦𝑗 = 𝑀 = ΰ·‘

𝑦:𝑦𝑗=𝑀

π‘ž 𝑦1, 𝑦2, … , 𝑦𝑗, … , 𝑦𝑂

definition of marginal

… …

main idea: use bipartite nature of graph to efficiently compute the marginals

The factor nodes can act as filters

slide-97
SLIDE 97

Sum-Product

π‘ž 𝑦𝑗 = 𝑀 = ΰ·‘

𝑦:𝑦𝑗=𝑀

π‘ž 𝑦1, 𝑦2, … , 𝑦𝑗, … , 𝑦𝑂

definition of marginal

… …

main idea: use bipartite nature of graph to efficiently compute the marginals 𝑠

π‘›β†’π‘œ

𝑠

π‘›β†’π‘œ

𝑠

π‘›β†’π‘œ

slide-98
SLIDE 98

Sum-Product

π‘ž 𝑦𝑗 = 𝑀 = ΰ·‘

𝑔

𝑠

𝑔→𝑦𝑗(𝑦𝑗) alternative marginal computation

… …

main idea: use bipartite nature of graph to efficiently compute the marginals 𝑠

π‘›β†’π‘œ

𝑠

π‘›β†’π‘œ

𝑠

π‘›β†’π‘œ

slide-99
SLIDE 99

Sum-Product

From variables to factors π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

n m

set of factors in which variable n participates

default value of 1 if empty product

slide-100
SLIDE 100

Sum-Product

From variables to factors π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ From factors to variables 𝑠

π‘›β†’π‘œ π‘¦π‘œ

= ෍

𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

n m n m

set of variables that the mth factor depends on set of factors in which variable n participates sum over configuration of variables for the mth factor, with variable n fixed

default value of 1 if empty product

slide-101
SLIDE 101

Example

Q: What are the variables? 𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

slide-102
SLIDE 102

Example

Q: What are the variables? A: 𝑦1, 𝑦2, 𝑦3, 𝑦4 Q: What are the factors? 𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

slide-103
SLIDE 103

Example

Q: What are the variables? A: 𝑦1, 𝑦2, 𝑦3, 𝑦4 Q: What are the factors? A: 𝑔

𝑏 𝑦1, 𝑦2 ,

𝑔

𝑐 𝑦2, 𝑦3 ,

𝑔

𝑑(𝑦2, 𝑦4)

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

slide-104
SLIDE 104

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

Q: What is the distribution we’re modeling?

slide-105
SLIDE 105

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

Q: What is the distribution we’re modeling?

A: π‘ž 𝑦1, 𝑦2, 𝑦3, 𝑦4 = 𝑔

𝑏 𝑦1, 𝑦2 𝑔 𝑐 𝑦2, 𝑦3 𝑔 𝑑(𝑦2, 𝑦4)

slide-106
SLIDE 106

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root

π‘Ÿπ‘¦1→𝑔

𝑏 𝑦1 = 1

π‘Ÿπ‘¦4→𝑔

𝑑 𝑦4 = 1

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-107
SLIDE 107

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root

π‘Ÿπ‘¦1→𝑔

𝑏 𝑦1 = 1

π‘Ÿπ‘¦4→𝑔

𝑑 𝑦4 = 1

𝑠

𝑔

𝑏→𝑦2 𝑦2 =? ? ?

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-108
SLIDE 108

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root

π‘Ÿπ‘¦1→𝑔

𝑏 𝑦1 = 1

π‘Ÿπ‘¦4→𝑔

𝑑 𝑦4 = 1

𝑠

𝑔

𝑏→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦1 = 𝑙, 𝑦2)

𝑠

𝑔

𝑑→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦2, 𝑦4 = 𝑙)

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-109
SLIDE 109

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root

π‘Ÿπ‘¦1→𝑔

𝑏 𝑦1 = 1

π‘Ÿπ‘¦4→𝑔

𝑑 𝑦4 = 1

𝑠

𝑔

𝑏→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦1 = 𝑙, 𝑦2)

𝑠

𝑔

𝑑→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦2, 𝑦4 = 𝑙)

π‘Ÿπ‘¦2→𝑔𝑐 𝑦2 =? ? ? π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-110
SLIDE 110

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root

π‘Ÿπ‘¦1→𝑔

𝑏 𝑦1 = 1

π‘Ÿπ‘¦4→𝑔

𝑑 𝑦4 = 1

𝑠

𝑔

𝑏→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦1 = 𝑙, 𝑦2)

𝑠

𝑔

𝑑→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦2, 𝑦4 = 𝑙)

π‘Ÿπ‘¦2→𝑔𝑐 𝑦2 = 𝑠

𝑔

𝑏→𝑦2 𝑦2 𝑠

𝑔

𝑑→𝑦2 𝑦2

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-111
SLIDE 111

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root

π‘Ÿπ‘¦1→𝑔

𝑏 𝑦1 = 1

π‘Ÿπ‘¦4→𝑔

𝑑 𝑦4 = 1

𝑠

𝑔

𝑏→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦1 = 𝑙, 𝑦2)

𝑠

𝑔

𝑑→𝑦2 𝑦2 = ෍

𝑙

𝑔

𝑏(𝑦2, 𝑦4 = 𝑙)

π‘Ÿπ‘¦2→𝑔𝑐 𝑦2 = 𝑠

𝑔

𝑏→𝑦2 𝑦2 𝑠

𝑔

𝑑→𝑦2 𝑦2

𝑠

𝑔𝑐→𝑦3 𝑦3 = ෍ 𝑙

𝑔

𝑐(𝑦2 = 𝑙, 𝑦3)

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-112
SLIDE 112

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves

π‘Ÿπ‘¦3→𝑔𝑐 𝑦3 = 1 π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-113
SLIDE 113

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves

π‘Ÿπ‘¦3→𝑔𝑐 𝑦3 = 1 𝑠

𝑔𝑐→𝑦2 𝑦2 = ෍ 𝑙

𝑔

𝑐(𝑦2, 𝑦3 = 𝑙)

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-114
SLIDE 114

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves

π‘Ÿπ‘¦3→𝑔𝑐 𝑦3 = 1 𝑠

𝑔𝑐→𝑦2 𝑦2 = ෍ 𝑙

𝑔

𝑐(𝑦2, 𝑦3 = 𝑙)

π‘Ÿπ‘¦2→𝑔

𝑏 𝑦2 = ? ? ?

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-115
SLIDE 115

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves

π‘Ÿπ‘¦3→𝑔𝑐 𝑦3 = 1 𝑠

𝑔𝑐→𝑦2 𝑦2 = ෍ 𝑙

𝑔

𝑐(𝑦2, 𝑦3 = 𝑙)

π‘Ÿπ‘¦2→𝑔

𝑏 𝑦2 = 𝑠

𝑔𝑐→𝑦2 𝑦2 𝑠 𝑔

𝑑→𝑦2 𝑦2

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

We just computed this Q: Where did we compute this?

slide-116
SLIDE 116

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves

π‘Ÿπ‘¦3→𝑔𝑐 𝑦3 = 1 𝑠

𝑔𝑐→𝑦2 𝑦2 = ෍ 𝑙

𝑔

𝑐(𝑦2, 𝑦3 = 𝑙)

π‘Ÿπ‘¦2→𝑔

𝑏 𝑦2 = 𝑠

𝑔𝑐→𝑦2 𝑦2 𝑠 𝑔

𝑑→𝑦2 𝑦2

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

We just computed this Q: Where did we compute this? A: In step 1 (leaves β†’ root)

slide-117
SLIDE 117

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves

π‘Ÿπ‘¦3→𝑔𝑐 𝑦3 = 1 𝑠

𝑔𝑐→𝑦2 𝑦2 = ෍ 𝑙

𝑔

𝑐(𝑦2, 𝑦3 = 𝑙)

π‘Ÿπ‘¦2→𝑔

𝑏 𝑦2 = 𝑠

𝑔𝑐→𝑦2 𝑦2 𝑠 𝑔

𝑑→𝑦2 𝑦2

π‘Ÿπ‘¦2→𝑔

𝑑 𝑦2 = 𝑠

𝑔

𝑏→𝑦2 𝑦2 𝑠

𝑔𝑐→𝑦2 𝑦2

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-118
SLIDE 118

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves

π‘Ÿπ‘¦3→𝑔𝑐 𝑦3 = 1 𝑠

𝑔𝑐→𝑦2 𝑦2 = ෍ 𝑙

𝑔

𝑐(𝑦2, 𝑦3 = 𝑙)

π‘Ÿπ‘¦2→𝑔

𝑏 𝑦2 = 𝑠

𝑔𝑐→𝑦2 𝑦2 𝑠 𝑔

𝑑→𝑦2 𝑦2

π‘Ÿπ‘¦2→𝑔

𝑑 𝑦2 = 𝑠

𝑔

𝑏→𝑦2 𝑦2 𝑠

𝑔𝑐→𝑦2 𝑦2

𝑠

𝑔

𝑑→𝑦4 𝑦4 = ෍

𝑙

𝑔

𝑑(𝑦2 = 𝑙, 𝑦4)

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-119
SLIDE 119

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves

π‘Ÿπ‘¦3→𝑔𝑐 𝑦3 = 1 𝑠

𝑔𝑐→𝑦2 𝑦2 = ෍ 𝑙

𝑔

𝑐(𝑦2, 𝑦3 = 𝑙)

π‘Ÿπ‘¦2→𝑔

𝑏 𝑦2 = 𝑠

𝑔𝑐→𝑦2 𝑦2 𝑠 𝑔

𝑑→𝑦2 𝑦2

π‘Ÿπ‘¦2→𝑔

𝑑 𝑦2 = 𝑠

𝑔

𝑏→𝑦2 𝑦2 𝑠

𝑔𝑐→𝑦2 𝑦2

𝑠

𝑔

𝑑→𝑦4 𝑦4 = ෍

𝑙

𝑔

𝑑(𝑦2 = 𝑙, 𝑦4)

𝑠

𝑔

𝑏→𝑦1 𝑦1 = ෍

𝑙

𝑔

𝑏(𝑦1, 𝑦2 = 𝑙)

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-120
SLIDE 120

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves
  • 3. Use messages to compute marginal

probabilities π‘ž π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-121
SLIDE 121

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves
  • 3. Use messages to compute marginal

probabilities π‘ž π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ π‘ž 𝑦1 = 𝑠

𝑔

𝑏→𝑦1 𝑦1

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-122
SLIDE 122

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves
  • 3. Use messages to compute marginal

probabilities π‘ž π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ π‘ž 𝑦1 = 𝑠

𝑔

𝑏→𝑦1 𝑦1

π‘ž 𝑦2 = 𝑠

𝑔

𝑏→𝑦2 𝑦2 𝑠

𝑔𝑐→𝑦2 𝑦2 𝑠 𝑔

𝑑→𝑦2 𝑦2

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-123
SLIDE 123

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves
  • 3. Use messages to compute marginal

probabilities π‘ž π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ π‘ž 𝑦1 = 𝑠

𝑔

𝑏→𝑦1 𝑦1

π‘ž 𝑦2 = 𝑠

𝑔

𝑏→𝑦2 𝑦2 𝑠

𝑔𝑐→𝑦2 𝑦2 𝑠 𝑔

𝑑→𝑦2 𝑦2

π‘ž 𝑦3 = 𝑠

𝑔𝑐→𝑦3 𝑦3

π‘ž 𝑦4 = 𝑠

𝑔

𝑑→𝑦4 𝑦4

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-124
SLIDE 124

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves
  • 3. Use messages to compute marginal

probabilities

  • 2. Are we done?
  • 1. If a tree structure, we’ve converged

2. π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-125
SLIDE 125

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree (𝑦3)
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves
  • 3. Use messages to compute marginal

probabilities

  • 2. Are we done?
  • 1. If a tree structure, we’ve converged
  • 2. If not:
  • 1. Either accept the partially

converged result, or… 2. π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-126
SLIDE 126

Example

𝑦2 𝑦1 𝑦3 𝑦4 𝑔

𝑏

𝑔

𝑐

𝑔

𝑑

  • 1. Select the root, or pick one if a tree
  • 1. Send messages from leaves to root
  • 2. Send messages from root to leaves
  • 3. Use messages to compute marginal

probabilities

  • 2. Are we done?
  • 1. If a tree structure, we’ve converged
  • 2. If not:
  • 1. Either accept the partially

converged result, or…

  • 2. Go back to (1) and repeat

[Loopy BP]

π‘Ÿπ‘œβ†’π‘› π‘¦π‘œ = ΰ·‘

π‘›β€²βˆˆπ‘(π‘œ)\𝑛

π‘ π‘›β€²β†’π‘œ π‘¦π‘œ

𝑠

π‘›β†’π‘œ π‘¦π‘œ = ෍ 𝒙𝑛\π‘œ

𝑔

𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

slide-127
SLIDE 127

Max-Product (Max-Sum)

Problem: how to find the most likely (best) setting of latent variables Replace sum (+) with max in factor→variable computations

𝑠

π‘›β†’π‘œ π‘¦π‘œ = max 𝒙𝑛\π‘œ 𝑔 𝑛 𝒙𝑛

ΰ·‘

π‘œβ€²βˆˆπ‘‚(𝑛)\π‘œ

π‘Ÿπ‘œβ€²β†’π‘›(π‘¦π‘œβ€²)

(why max-sum? computationally, implement with logs)

slide-128
SLIDE 128

Loopy Belief Propagation

Sum-product algorithm is not exact for general graphs Loopy Belief Propagation (Loopy BP): run sum- product algorithm anyway and hope for the best Requires a message passing schedule

slide-129
SLIDE 129

Outline

Directed Graphical Models NaΓ―ve Bayes Undirected Graphical Models Factor Graphs Ising Model Message Passing: Graphical Model Inference