Models CMSC 678 UMBC Announcement 1: Progress Report on Project - - PowerPoint PPT Presentation

models
SMART_READER_LITE
LIVE PREVIEW

Models CMSC 678 UMBC Announcement 1: Progress Report on Project - - PowerPoint PPT Presentation

Undirected Probabilistic Graphical Models CMSC 678 UMBC Announcement 1: Progress Report on Project Due Monday April 16 th , 11:59 AM Build on the proposal: Update to address comments Discuss the progress youve made Discuss what remains


slide-1
SLIDE 1

Undirected Probabilistic Graphical Models

CMSC 678 UMBC

slide-2
SLIDE 2

Announcement 1: Progress Report on Project

Due Monday April 16th, 11:59 AM Build on the proposal: Update to address comments Discuss the progress you’ve made Discuss what remains to be done Discuss any new blocks you’ve experienced (or anticipate experiencing) Any questions?

slide-3
SLIDE 3

Announcement 2: Assignment 4

Due Monday May 14th, 11:59 AM Topic: probabilistic & graphical modeling

slide-4
SLIDE 4

Recap from last time…

slide-5
SLIDE 5

Hidden Markov Model Representation

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

z1

w1

w2 w3 w4

z2 z3 z4

represent the probabilities and independence assumptions in a graph

slide-6
SLIDE 6

Viterbi Algorithm

v = double[N+2][K*] b = int[N+2][K*] v[*][*] = 0 v[0][START] = 1 for(i = 1; i ≤ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state | old) if(v[i-1][old] * pobs * pmove > v[i][state]) { v[i][state] = v[i-1][old] * pobs * pmove b[i][state] = old } } } }

backpointers/ book-keeping computing v at time i-1 will correctly incorporate (maximize over) paths through time i-2: we correctly obey the Markov property v(i, B) is the maximum probability of any paths to that state B from the beginning (and emitting the

  • bservation)
slide-7
SLIDE 7

Marginal Probability (via the Forward Algorithm)

α(i, s) is the total probability of all paths:

  • 1. that start from the beginning
  • 2. that end (currently) in s at step i
  • 3. that emit the observation obs at i

𝛽 𝑗, 𝑡 = ෍

𝑡′

𝛽 𝑗 − 1, 𝑡′ ∗ 𝑞 𝑡 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝑡)

how likely is it to get into state s this way? what are the immediate ways to get into state s? what’s the total probability up until now?

Q: What do we return? (How do we return the likelihood of the sequence?)

A: α[N+1][end]

There’s an analogous backwards algorithm

slide-8
SLIDE 8

With Both Forward and Backward Values

α(i, s) * p(s’ | B) * p(obs at i+1 | s’) * β(i+1, s’) = total probability of paths through the ss’ arc (at time i) α(i, s) * β(i, s) = total probability of paths through state s at step i

𝑞 𝑨𝑗 = 𝑡 𝑥1, ⋯ , 𝑥𝑂) = 𝛽 𝑗, 𝑡 ∗ 𝛾(𝑗, 𝑡) 𝛽(𝑂 + 1, END) 𝑞 𝑨𝑗 = 𝑡, 𝑨𝑗+1 = 𝑡′ 𝑥1, ⋯ , 𝑥𝑂) = 𝛽 𝑗, 𝑡 ∗ 𝑞 𝑡′ 𝑡 ∗ 𝑞 obs𝑗+1 𝑡′ ∗ 𝛾(𝑗 + 1, 𝑡′) 𝛽(𝑂 + 1, END)

slide-9
SLIDE 9

EM For HMMs (Baum-Welch Algorithm)

α = computeForwards() β = computeBackwards() L = α[N+1][END] for(i = N; i ≥ 0; --i) { for(next = 0; next < K*; ++next) { cobs(obsi+1 | next) += α[i+1][next]* β[i+1][next]/L for(state = 0; state < K*; ++state) { u = pobs(obsi+1 | next) * ptrans (next | state) ctrans(next| state) += α[i][state] * u * β[i+1][next]/L } } }

slide-10
SLIDE 10

Bayesian Networks: Directed Acyclic Graphs

𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = ෑ

𝑗

𝑞 𝑦𝑗 𝜌(𝑦𝑗))

“parents of” topological sort

slide-11
SLIDE 11

Bayesian Networks: Directed Acyclic Graphs

𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = ෑ

𝑗

𝑞 𝑦𝑗 𝜌(𝑦𝑗))

exact inference in general DAGs is NP-hard inference in trees can be exact

slide-12
SLIDE 12

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z

d-separation

P has a chain with an observed middle node P has a fork with an observed parent node P includes a “v-structure” or “collider” with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Y X Y X Z Y

slide-13
SLIDE 13

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z

d-separation

P has a chain with an observed middle node P has a fork with an observed parent node P includes a “v-structure” or “collider” with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Z Y X Z Y X Z Y

  • bserving Z blocks

the path from X to Y

  • bserving Z blocks

the path from X to Y not observing Z blocks the path from X to Y

slide-14
SLIDE 14

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z

d-separation

P has a chain with an observed middle node P has a fork with an observed parent node P includes a “v-structure” or “collider” with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Z Y X Z Y X Z Y

  • bserving Z blocks

the path from X to Y

  • bserving Z blocks

the path from X to Y not observing Z blocks the path from X to Y 𝑞 𝑦, 𝑧, 𝑨 = 𝑞 𝑦 𝑞 𝑧 𝑞(𝑨|𝑦, 𝑧) 𝑞 𝑦, 𝑧 = ෍

𝑨

𝑞 𝑦 𝑞 𝑧 𝑞(𝑨|𝑦, 𝑧) = 𝑞 𝑦 𝑞 𝑧

slide-15
SLIDE 15

Markov Blanket

x Markov blanket of a node x is its parents, children, and children's parents

𝑞 𝑦𝑗 𝑦𝑘≠𝑗 = 𝑞(𝑦1, … , 𝑦𝑂) ∫ 𝑞 𝑦1, … , 𝑦𝑂 𝑒𝑦𝑗 = ς𝑙 𝑞(𝑦𝑙|𝜌 𝑦𝑙 ) ∫ ς𝑙 𝑞 𝑦𝑙 𝜌 𝑦𝑙 ) 𝑒𝑦𝑗

factor out terms not dependent on xi

factorization

  • f graph

= ς𝑙:𝑙=𝑗 or 𝑗∈𝜌 𝑦𝑙 𝑞(𝑦𝑙|𝜌 𝑦𝑙 ) ∫ ς𝑙:𝑙=𝑗 or 𝑗∈𝜌 𝑦𝑙 𝑞 𝑦𝑙 𝜌 𝑦𝑙 ) 𝑒𝑦𝑗

the set of nodes needed to form the complete conditional for a variable xi

slide-16
SLIDE 16

Markov Random Fields: Undirected Graphs

𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂

slide-17
SLIDE 17

Markov Random Fields: Undirected Graphs

𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂

clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique

slide-18
SLIDE 18

Markov Random Fields: Undirected Graphs

𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 𝑎 ෑ

𝐷

𝜔𝐷 𝑦𝑑

variables part

  • f the clique C

maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!)

slide-19
SLIDE 19

Markov Random Fields: Undirected Graphs

𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 𝑎 ෑ

𝐷

𝜔𝐷 𝑦𝑑

variables part

  • f the clique C

maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!)

slide-20
SLIDE 20

Markov Random Fields: Undirected Graphs

𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 𝑎 ෑ

𝐷

𝜔𝐷 𝑦𝑑

variables part

  • f the clique C

maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!) Q: What restrictions should we place on the potentials 𝜔𝐷?

slide-21
SLIDE 21

Markov Random Fields: Undirected Graphs

𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 𝑎 ෑ

𝐷

𝜔𝐷 𝑦𝑑

variables part

  • f the clique C

maximal cliques global normalization clique: subset of nodes, where nodes are pairwise connected maximal clique: a clique that cannot add a node and remain a clique potential function (not necessarily a probability!) Q: What restrictions should we place on the potentials 𝜔𝐷? A: 𝜔𝐷 ≥ 0 (or 𝜔𝐷 > 0)

slide-22
SLIDE 22

Terminology: Potential Functions

𝜔𝐷 𝑦𝑑 = exp −𝐹(𝑦𝐷)

energy function (for clique C) Boltzmann distribution

(get the total energy of a configuration by summing the individual energy functions)

𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = 1 𝑎 ෑ

𝐷

𝜔𝐷 𝑦𝑑

slide-23
SLIDE 23

Ambiguity in Undirected Model Notation

X Y Z

𝑞 𝑦, 𝑧, 𝑨 ∝ 𝜔(𝑦, 𝑧, 𝑨) 𝑞 𝑦, 𝑧, 𝑨 ∝ 𝜔1 𝑦,𝑧 𝜔2 𝑧,𝑨 𝜔3 𝑦,𝑨

slide-24
SLIDE 24

Example: Ising Model

x: original pixel/state y:

  • bserved

(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)

  • riginal

w/ 10% noise two solutions

Q: What are the cliques?

slide-25
SLIDE 25

Example: Ising Model

x: original pixel/state y:

  • bserved

(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)

  • riginal

w/ 10% noise two solutions

Q: What are the cliques?

slide-26
SLIDE 26

Example: Ising Model

x: original pixel/state y:

  • bserved

(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)

  • riginal

w/ 10% noise two solutions

𝐹 𝑦, 𝑧 = ℎ ෍

𝑗

𝑦𝑗 − 𝛾 ෍

𝑗𝑘

𝑦𝑗𝑦𝑘 − 𝜃 ෍

𝑗

𝑦𝑗𝑧𝑗

xi and yi should be correlated neighboring pixels should be similar allow for a bias

slide-27
SLIDE 27

Example: Ising Model

x: original pixel/state y:

  • bserved

(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)

  • riginal

w/ 10% noise two solutions

𝐹 𝑦, 𝑧 = ℎ ෍

𝑗

𝑦𝑗 − 𝛾 ෍

𝑗𝑘

𝑦𝑗𝑦𝑘 − 𝜃 ෍

𝑗

𝑦𝑗𝑧𝑗

xi and yi should be correlated neighboring pixels should be similar allow for a bias

slide-28
SLIDE 28

Example: Ising Model

x: original pixel/state y:

  • bserved

(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)

  • riginal

w/ 10% noise two solutions

𝐹 𝑦, 𝑧 = ℎ ෍

𝑗

𝑦𝑗 − 𝛾 ෍

𝑗𝑘

𝑦𝑗𝑦𝑘 − 𝜃 ෍

𝑗

𝑦𝑗𝑧𝑗

xi and yi should be correlated neighboring pixels should be similar allow for a bias

Q: Why subtract β and η?

slide-29
SLIDE 29

Example: Ising Model

x: original pixel/state y:

  • bserved

(noisy) pixel/state Image denoising (Bishop, 2006; Fig 8.30)

  • riginal

w/ 10% noise two solutions

𝐹 𝑦, 𝑧 = ℎ ෍

𝑗

𝑦𝑗 − 𝛾 ෍

𝑗𝑘

𝑦𝑗𝑦𝑘 − 𝜃 ෍

𝑗

𝑦𝑗𝑧𝑗

xi and yi should be correlated neighboring pixels should be similar allow for a bias

Q: Why subtract β and η? A: Better states  lower energy (higher potential) 𝜔𝐷 𝑦𝑑 = exp −𝐹(𝑦𝐷)

slide-30
SLIDE 30

Markov Random Fields with Factor Graph Notation

x: original pixel/state y: observed (noisy) pixel/state

factor nodes are added according to maximal cliques

unary factor variable

factor graphs are bipartite

binary factor

slide-31
SLIDE 31

Different Factor Graph Notation for the Same Graph

X Y Z X Y Z X Y Z

slide-32
SLIDE 32

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

slide-33
SLIDE 33

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

slide-34
SLIDE 34

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Undirected

(e.g., conditional random field [CRF]) z1

w1 w2 w3 w4

z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

slide-35
SLIDE 35

Example: Linear Chain

Directed (e.g.,

hidden Markov model [HMM]; generative) z1

w1 w2 w3 w4

z2 z3 z4

Undirected as factor graph

(e.g., conditional random field [CRF]) z1 z2 z3 z4

Directed (e.g..,

maximum entropy Markov model [MEMM]; conditional) z1

w1 w2 w3 w4

z2 z3 z4

slide-36
SLIDE 36

Directed vs. Undirected Models: Moralization

x1 x2 x3 x4

slide-37
SLIDE 37

Directed vs. Undirected Models: Moralization

x1 x2 x3 x4

𝑞 𝑦1, … , 𝑦4 = 𝑞 𝑦1 𝑞 𝑦2 𝑞 𝑦3 𝑞(𝑦4|𝑦1, 𝑦2, 𝑦3)

x1 x2 x3 x4

slide-38
SLIDE 38

Directed vs. Undirected Models: Moralization

x1 x2 x3 x4

𝑞 𝑦1, … , 𝑦4 = 𝑞 𝑦1 𝑞 𝑦2 𝑞 𝑦3 𝑞(𝑦4|𝑦1, 𝑦2, 𝑦3)

x1 x2 x3 x4 parents of nodes in a directed graph must be connected in an undirected graph

slide-39
SLIDE 39

Two Problems for Undirected Models

Finding the normalizer Computing the marginals

slide-40
SLIDE 40

Two Problems for Undirected Models

Finding the normalizer 𝑎 = ෍

𝑦

𝑑

𝜔𝑑(𝑦𝑑) Computing the marginals

slide-41
SLIDE 41

Two Problems for Undirected Models

Finding the normalizer 𝑎 = ෍

𝑦

𝑑

𝜔𝑑(𝑦𝑑) Computing the marginals 𝑎𝑜(𝑤) = ෍

𝑦:𝑦𝑜=𝑤

𝑑

𝜔𝑑(𝑦𝑑)

Sum over all variable combinations, with the xn coordinate fixed 𝑎2(𝑤) = ෍

𝑦1

𝑦3

𝑑

𝜔𝑑(𝑦 = 𝑦1, 𝑤, 𝑦3 ) Example: 3 variables, fix the 2nd dimension

slide-42
SLIDE 42

Two Problems for Undirected Models

Finding the normalizer 𝑎 = ෍

𝑦

𝑑

𝜔𝑑(𝑦𝑑) Computing the marginals 𝑎𝑜(𝑤) = ෍

𝑦:𝑦𝑜=𝑤

𝑑

𝜔𝑑(𝑦𝑑)

Q: Why are these difficult? Sum over all variable combinations, with the xn coordinate fixed 𝑎2(𝑤) = ෍

𝑦1

𝑦3

𝑑

𝜔𝑑(𝑦 = 𝑦1, 𝑤, 𝑦3 ) Example: 3 variables, fix the 2nd dimension

slide-43
SLIDE 43

Two Problems for Undirected Models

Finding the normalizer 𝑎 = ෍

𝑦

𝑑

𝜔𝑑(𝑦𝑑) Computing the marginals 𝑎𝑜(𝑤) = ෍

𝑦:𝑦𝑜=𝑤

𝑑

𝜔𝑑(𝑦𝑑)

Q: Why are these difficult? A: Many different combinations Sum over all variable combinations, with the xn coordinate fixed 𝑎2(𝑤) = ෍

𝑦1

𝑦3

𝑑

𝜔𝑑(𝑦 = 𝑦1, 𝑤, 𝑦3 ) Example: 3 variables, fix the 2nd dimension

slide-44
SLIDE 44

Sum-Product Algorithm

Main idea: message passing An exact inference algorithm for tree-like graphs Belief propagation (forward-backward for HMMs) is a special case

slide-45
SLIDE 45

Sum-Product

𝑞 𝑦𝑗 = 𝑤 = ෑ

𝑦:𝑦𝑗=𝑤

𝑞 𝑦1, 𝑦2, … , 𝑦𝑗, … , 𝑦𝑂

definition of marginal

… …

slide-46
SLIDE 46

Sum-Product

𝑞 𝑦𝑗 = 𝑤 = ෑ

𝑦:𝑦𝑗=𝑤

𝑞 𝑦1, 𝑦2, … , 𝑦𝑗, … , 𝑦𝑂

definition of marginal

… …

main idea: use bipartite nature of graph to efficiently compute the marginals

slide-47
SLIDE 47

Sum-Product

𝑞 𝑦𝑗 = 𝑤 = ෑ

𝑦:𝑦𝑗=𝑤

𝑞 𝑦1, 𝑦2, … , 𝑦𝑗, … , 𝑦𝑂

definition of marginal

… …

main idea: use bipartite nature of graph to efficiently compute the marginals 𝑠

𝑛→𝑜

𝑠

𝑛→𝑜

𝑠

𝑛→𝑜

slide-48
SLIDE 48

Sum-Product

𝑞 𝑦𝑗 = 𝑤 = ෑ

𝑔

𝑠

𝑔→𝑦𝑗(𝑦𝑗) alternative marginal computation

… …

main idea: use bipartite nature of graph to efficiently compute the marginals 𝑠

𝑛→𝑜

𝑠

𝑛→𝑜

𝑠

𝑛→𝑜

slide-49
SLIDE 49

Sum-Product

𝑞 𝑦𝑗 = 𝑤 = ෑ

𝑔

𝑠

𝑔→𝑦𝑗(𝑦𝑗) alternative marginal computation

… …

main idea: use bipartite nature of graph to efficiently compute the marginals 𝑠

𝑛→𝑜

𝑠

𝑛→𝑜

𝑠

𝑛→𝑜

slide-50
SLIDE 50

Sum-Product

From variables to factors 𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜

n m

set of factors in which variable n participates

default value of 1 if empty product

slide-51
SLIDE 51

Sum-Product

From variables to factors 𝑟𝑜→𝑛 𝑦𝑜 = ෑ

𝑛′∈𝑁(𝑜)\𝑛

𝑠𝑛′→𝑜 𝑦𝑜 From factors to variables 𝑠

𝑛→𝑜 𝑦𝑜

= ෍

𝒙𝑛\𝑜

𝑔

𝑛 𝒙𝑛

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

n m n m

set of variables that the mth factor depends on set of factors in which variable n participates sum over configuration of variables for the mth factor, with variable n fixed

default value of 1 if empty product

slide-52
SLIDE 52

Example

x2 x1 x3 x4 f g h

slide-53
SLIDE 53

Max-Product (Max-Sum)

Problem: how to find the most likely (best) setting of latent variables Replace sum (+) with max in factorvariable computations

𝑠

𝑛→𝑜 𝑦𝑜 = max 𝒙𝑛\𝑜 𝑔 𝑛 𝒙𝑛

𝑜′∈𝑂(𝑛)\𝑜

𝑟𝑜′→𝑛(𝑦𝑜′)

(why max-sum? computationally, implement with logs)

slide-54
SLIDE 54

Loopy Belief Propagation

Sum-product algorithm is not exact for general graphs Loopy Belief Propagation (Loopy BP): run sum- product algorithm anyway and hope for the best Requires a message passing schedule Next classes: approximate algorithms