Directed Probabilistic Graphical Models CMSC 678 UMBC Announcement - - PowerPoint PPT Presentation

directed probabilistic graphical models
SMART_READER_LITE
LIVE PREVIEW

Directed Probabilistic Graphical Models CMSC 678 UMBC Announcement - - PowerPoint PPT Presentation

Directed Probabilistic Graphical Models CMSC 678 UMBC Announcement 1: Assignment 3 Due Wednesday April 11 th , 11:59 AM Any questions? Announcement 2: Progress Report on Project Due Monday April 16 th , 11:59 AM Build on the proposal: Update


slide-1
SLIDE 1

Directed Probabilistic Graphical Models

CMSC 678 UMBC

slide-2
SLIDE 2

Announcement 1: Assignment 3

Due Wednesday April 11th, 11:59 AM Any questions?

slide-3
SLIDE 3

Announcement 2: Progress Report on Project

Due Monday April 16th, 11:59 AM Build on the proposal: Update to address comments Discuss the progress you’ve made Discuss what remains to be done Discuss any new blocks you’ve experienced (or anticipate experiencing) Any questions?

slide-4
SLIDE 4

Outline

Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

slide-5
SLIDE 5

Recap from last time…

slide-6
SLIDE 6

Expectation Maximization (EM): E-step

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

count(𝑨𝑗, 𝑥𝑗) 𝑞(𝑨𝑗) 𝑞 𝑢+1 (𝑨) 𝑞(𝑢)(𝑨)

estimated counts

http://blog.innotas.com/wp-
slide-7
SLIDE 7

EM Math

max

𝜄

𝔽𝑨 ~ 𝑞𝜄(𝑢)(⋅|𝑥) log 𝑞𝜄(𝑨, 𝑥)

E-step: count under uncertainty M-step: maximize log-likelihood

  • ld parameters

new parameters new parameters posterior distribution 𝒟 𝜄 = log-likelihood of complete data (X,Y) 𝒬 𝜄 = posterior log- likelihood of incomplete data Y ℳ 𝜄 = marginal log- likelihood of observed data X

ℳ 𝜄 = 𝔽𝑍∼𝜄(𝑢)[𝒟 𝜄 |𝑌] − 𝔽𝑍∼𝜄(𝑢)[𝒬 𝜄 |𝑌]

EM does not decrease the marginal log-likelihood

slide-8
SLIDE 8

Outline

Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

slide-9
SLIDE 9

Assume an original optimization problem

Lagrange multipliers

slide-10
SLIDE 10

Assume an original optimization problem We convert it to a new optimization problem:

Lagrange multipliers

slide-11
SLIDE 11

Lagrange multipliers: an equivalent problem?

slide-12
SLIDE 12

Lagrange multipliers: an equivalent problem?

slide-13
SLIDE 13

Lagrange multipliers: an equivalent problem?

slide-14
SLIDE 14

Outline

Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

slide-15
SLIDE 15

Probabilistic Estimation of Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

𝑥1 = 1 𝑥2 = 5 𝑥3 = 4 ⋯

for roll 𝑗 = 1 to 𝑂: 𝑥𝑗 ∼ Cat(𝜄)

Generative Story a probability distribution over 6 sides of the die ෍

𝑙=1 6

𝜄𝑙 = 1 0 ≤ 𝜄𝑙 ≤ 1, ∀𝑙

slide-16
SLIDE 16

Probabilistic Estimation of Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

𝑥1 = 1 𝑥2 = 5 𝑥3 = 4 ⋯

for roll 𝑗 = 1 to 𝑂: 𝑥𝑗 ∼ Cat(𝜄)

Generative Story

ℒ 𝜄 = ෍

𝑗

log 𝑞𝜄(𝑥𝑗) = ෍

𝑗

log 𝜄𝑥𝑗

Maximize Log-likelihood

slide-17
SLIDE 17

Probabilistic Estimation of Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

for roll 𝑗 = 1 to 𝑂: 𝑥𝑗 ∼ Cat(𝜄)

Generative Story

ℒ 𝜄 = ෍

𝑗

log 𝜄𝑥𝑗

Maximize Log-likelihood Q: What’s an easy way to maximize this, as written exactly (even without calculus)?

slide-18
SLIDE 18

Probabilistic Estimation of Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

for roll 𝑗 = 1 to 𝑂: 𝑥𝑗 ∼ Cat(𝜄)

Generative Story

ℒ 𝜄 = ෍

𝑗

log 𝜄𝑥𝑗

Maximize Log-likelihood Q: What’s an easy way to maximize this, as written exactly (even without calculus)? A: Just keep increasing 𝜄𝑙 (we know 𝜄 must be a distribution, but it’s not specified)

slide-19
SLIDE 19

Probabilistic Estimation of Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

ℒ 𝜄 = ෍

𝑗

log 𝜄𝑥𝑗 s. t. ෍

𝑙=1 6

𝜄𝑙 = 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints 0 ≤ 𝜄𝑙, but it complicates the problem and, right now, is not needed)

solve using Lagrange multipliers

slide-20
SLIDE 20

Probabilistic Estimation of Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

ℱ 𝜄 = ෍

𝑗

log 𝜄𝑥𝑗 − 𝜇 ෍

𝑙=1 6

𝜄𝑙 − 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints 0 ≤ 𝜄𝑙, but it complicates the problem and, right now, is not needed)

𝜖ℱ 𝜄 𝜖𝜄𝑙 = ෍

𝑗:𝑥𝑗=𝑙

1 𝜄𝑥𝑗 − 𝜇 𝜖ℱ 𝜄 𝜖𝜇 = − ෍

𝑙=1 6

𝜄𝑙 + 1

slide-21
SLIDE 21

Probabilistic Estimation of Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

ℱ 𝜄 = ෍

𝑗

log 𝜄𝑥𝑗 − 𝜇 ෍

𝑙=1 6

𝜄𝑙 − 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints 0 ≤ 𝜄𝑙, but it complicates the problem and, right now, is not needed)

𝜄𝑙 = σ𝑗:𝑥𝑗=𝑙 1 𝜇

  • ptimal 𝜇 when ෍

𝑙=1 6

𝜄𝑙 = 1

slide-22
SLIDE 22

Probabilistic Estimation of Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

ℱ 𝜄 = ෍

𝑗

log 𝜄𝑥𝑗 − 𝜇 ෍

𝑙=1 6

𝜄𝑙 − 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints 0 ≤ 𝜄𝑙, but it complicates the problem and, right now, is not needed)

𝜄𝑙 = σ𝑗:𝑥𝑗=𝑙 1 σ𝑙 σ𝑗:𝑥𝑗=𝑙 1 = 𝑂𝑙 𝑂

  • ptimal 𝜇 when ෍

𝑙=1 6

𝜄𝑙 = 1

slide-23
SLIDE 23

Example: Conditionally Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗 𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

add complexity to better explain what we see

𝑥1 = 1 𝑥2 = 5 ⋯ 𝑨1 = 𝐼 𝑨2 = 𝑈

𝑞 heads = 𝜇 𝑞 tails = 1 − 𝜇 𝑞 heads = 𝛿 𝑞 heads = 𝜔 𝑞 tails = 1 − 𝛿 𝑞 tails = 1 − 𝜔

slide-24
SLIDE 24

Example: Conditionally Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗 𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

add complexity to better explain what we see

𝑞 heads = 𝜇 𝑞 tails = 1 − 𝜇 𝑞 heads = 𝛿 𝑞 heads = 𝜔 𝑞 tails = 1 − 𝛿 𝑞 tails = 1 − 𝜔 for item 𝑗 = 1 to 𝑂: 𝑨𝑗 ~ Bernoulli 𝜇

Generative Story

𝜇 = distribution over penny 𝛿 = distribution for dollar coin 𝜔 = distribution over dime if 𝑨𝑗 = 𝐼: 𝑥𝑗 ~ Bernoulli 𝛿 else: 𝑥𝑗 ~ Bernoulli 𝜔

slide-25
SLIDE 25

Outline

Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

slide-26
SLIDE 26

Classify with Bayes Rule argmax𝑍 log 𝑞 𝑌 𝑍) + log 𝑞(𝑍)

likelihood prior

argmax𝑍𝑞 𝑍 𝑌)

slide-27
SLIDE 27

The Bag of Words Representation

Adapted from Jurafsky & Martin (draft)

slide-28
SLIDE 28

The Bag of Words Representation

Adapted from Jurafsky & Martin (draft)

slide-29
SLIDE 29

The Bag of Words Representation

29

Adapted from Jurafsky & Martin (draft)

slide-30
SLIDE 30

Bag of Words Representation

γ( )=c

seen 2 sweet 1 whimsical 1 recommend 1 happy 1 ... ...

classifier classifier

Adapted from Jurafsky & Martin (draft)

slide-31
SLIDE 31

Naïve Bayes: A Generative Story

Generative Story

𝜚 = distribution over 𝐿 labels for label 𝑙 = 1 to 𝐿:

global parameters

𝜄𝑙 = generate parameters

slide-32
SLIDE 32

Naïve Bayes: A Generative Story

for item 𝑗 = 1 to 𝑂: 𝑧𝑗 ~ Cat 𝜚

Generative Story

𝜚 = distribution over 𝐿 labels

y

for label 𝑙 = 1 to 𝐿: 𝜄𝑙 = generate parameters

slide-33
SLIDE 33

Naïve Bayes: A Generative Story

for item 𝑗 = 1 to 𝑂: 𝑧𝑗 ~ Cat 𝜚

Generative Story

𝜚 = distribution over 𝐿 labels for each feature 𝑘 𝑦𝑗𝑘 ∼ F𝑘(𝜄𝑧𝑗)

𝑦𝑗1 𝑦𝑗2 𝑦𝑗3 𝑦𝑗4 𝑦𝑗5

y

for label 𝑙 = 1 to 𝐿: 𝜄𝑙 = generate parameters

local variables

slide-34
SLIDE 34

Naïve Bayes: A Generative Story

for item 𝑗 = 1 to 𝑂: 𝑧𝑗 ~ Cat 𝜚

Generative Story

𝜚 = distribution over 𝐿 labels

𝑦𝑗1 𝑦𝑗2 𝑦𝑗3 𝑦𝑗4 𝑦𝑗5

y

for label 𝑙 = 1 to 𝐿:

each xij conditionally independent of one another (given the label)

𝜄𝑙 = generate parameters for each feature 𝑘 𝑦𝑗𝑘 ∼ F𝑘(𝜄𝑧𝑗)

slide-35
SLIDE 35

Naïve Bayes: A Generative Story

for item 𝑗 = 1 to 𝑂: 𝑧𝑗 ~ Cat 𝜚

Generative Story

ℒ 𝜄 = ෍

𝑗

𝑘

log 𝐺

𝑧𝑗(𝑦𝑗𝑘; 𝜄𝑧𝑗) + ෍ 𝑗

log 𝜚𝑧𝑗 s. t.

Maximize Log-likelihood

𝜚 = distribution over 𝐿 labels

𝑦𝑗1 𝑦𝑗2 𝑦𝑗3 𝑦𝑗4 𝑦𝑗5

y

for label 𝑙 = 1 to 𝐿:

𝑙

𝜚𝑙 = 1 𝜄𝑙 is valid for 𝐺

𝑙

𝜄𝑙 = generate parameters for each feature 𝑘 𝑦𝑗𝑘 ∼ F𝑘(𝜄𝑧𝑗)

slide-36
SLIDE 36

Multinomial Naïve Bayes: A Generative Story

for item 𝑗 = 1 to 𝑂: 𝑧𝑗 ~ Cat 𝜚

Generative Story

ℒ 𝜄 = ෍

𝑗

𝑘

log 𝜄𝑧𝑗,𝑦𝑗,𝑘 + ෍

𝑗

log 𝜚𝑧𝑗 s. t.

Maximize Log-likelihood

𝜚 = distribution over 𝐿 labels for each feature 𝑘 𝑦𝑗𝑘 ∼ Cat(𝜄𝑧𝑗,𝑘)

𝑦𝑗1 𝑦𝑗2 𝑦𝑗3 𝑦𝑗4 𝑦𝑗5

y

for label 𝑙 = 1 to 𝐿:

𝜄𝑙 = distribution over J feature values

𝑙

𝜚𝑙 = 1 ෍

𝑘

𝜄𝑙𝑘 = 1 ∀𝑙

slide-37
SLIDE 37

Multinomial Naïve Bayes: A Generative Story

for item 𝑗 = 1 to 𝑂: 𝑧𝑗 ~ Cat 𝜚

Generative Story

ℒ 𝜄 = ෍

𝑗

𝑘

log 𝜄𝑧𝑗,𝑦𝑗,𝑘 + ෍

𝑗

log 𝜚𝑧𝑗 − 𝜈 ෍

𝑙

𝜚𝑙 − 1 − ෍

𝑙

𝜇𝑙 ෍

𝑘

𝜄𝑙𝑘 − 1

Maximize Log-likelihood via Lagrange Multipliers

𝜚 = distribution over 𝐿 labels for each feature 𝑘 𝑦𝑗𝑘 ∼ Cat(𝜄𝑧𝑗,𝑘)

𝑦𝑗1 𝑦𝑗2 𝑦𝑗3 𝑦𝑗4 𝑦𝑗5

y

for label 𝑙 = 1 to 𝐿:

𝜄𝑙 = distribution over J feature values

slide-38
SLIDE 38

Multinomial Naïve Bayes: Learning

Calculate class priors For each k:

itemsk = all items with class = k

Calculate feature generation terms For each k:

  • bsk = single object containing all

items labeled as k Foreach feature j nkj = # of occurrences of j in obsk

𝑞 𝑙 = |items𝑙| # items 𝑞 𝑘|𝑙 = 𝑜𝑙𝑘 σ𝑘′ 𝑜𝑙𝑘′

slide-39
SLIDE 39

Brill and Banko (2001) With enough data, the classifier may not matter

Adapted from Jurafsky & Martin (draft)

slide-40
SLIDE 40

Summary: Naïve Bayes is Not So Naïve, but not without issue

Pro

Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)

Con

Model the posterior in one go? (e.g., use conditional maxent) Are the features really uncorrelated? Are plain counts always appropriate? Are there “better” ways of handling missing/noisy data?

(automated, more principled)

Adapted from Jurafsky & Martin (draft)

slide-41
SLIDE 41

Outline

Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

slide-42
SLIDE 42

Hidden Markov Models

p(British Left Waffles on Falkland Islands)

Adjective Noun Verb Noun Verb Noun Prep Noun Noun Prep Noun Noun (i): (ii):

Class-based model Bigram model

  • f the classes

Model all class sequences

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗|𝑨𝑗−1

𝑨1,..,𝑨𝑂

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂

slide-43
SLIDE 43

Hidden Markov Model

Goal: maximize (log-)likelihood In practice: we don’t actually observe these z values; we just see the words w

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

slide-44
SLIDE 44

Hidden Markov Model

Goal: maximize (log-)likelihood In practice: we don’t actually observe these z values; we just see the words w

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :( if we did observe z, estimating the probability parameters would be easy… but we don’t! :(

slide-45
SLIDE 45

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

slide-46
SLIDE 46

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

transition probabilities/parameters

slide-47
SLIDE 47

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

slide-48
SLIDE 48

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states Transition and emission distributions do not change

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

slide-49
SLIDE 49

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states Transition and emission distributions do not change

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

Q: How many different probability values are there with K states and V vocab items?

slide-50
SLIDE 50

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states Transition and emission distributions do not change

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

Q: How many different probability values are there with K states and V vocab items? A: VK emission values and K2 transition values

slide-51
SLIDE 51

Hidden Markov Model Representation

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

z1

w1

w2 w3 w4

z2 z3 z4

represent the probabilities and independence assumptions in a graph

slide-52
SLIDE 52

Hidden Markov Model Representation

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

z1

w1

w2 w3 w4

z2 z3 z4

𝑞 𝑥1|𝑨1 𝑞 𝑥2|𝑨2 𝑞 𝑥3|𝑨3 𝑞 𝑥4|𝑨4

𝑞 𝑨2| 𝑨1 𝑞 𝑨3| 𝑨2 𝑞 𝑨4| 𝑨3 𝑞 𝑨1| 𝑨0 initial starting distribution (“__SEQ_SYM__”)

Each zi can take the value of one of K latent states Transition and emission distributions do not change

slide-53
SLIDE 53

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

w2 w3 w4

z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V

N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

slide-54
SLIDE 54

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥2|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

z2 = N z3 = N z4 = N z1 = V z2 = V z4 = V

𝑞 𝑥4|𝑊 𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 𝑞 𝑥1|𝑊

z3 = V

N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

slide-55
SLIDE 55

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥2|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| start

𝑞 𝑥4|𝑊 𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 𝑞 𝑥1|𝑊

𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂

N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

slide-56
SLIDE 56

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥2|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| start

𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊

𝑞 𝑥4|𝑊 𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 𝑞 𝑥1|𝑊

𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂

N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

slide-57
SLIDE 57

A Latent Sequence is a Path through the Graph

z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 Q: What’s the probability of (N, w1), (V, w2), (V, w3), (N, w4)? A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) = 0.0002822 N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

slide-58
SLIDE 58

Outline

Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

slide-59
SLIDE 59

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number ‘one’ to the soldier behind you. If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add

  • ne to it, and say the new

number to the soldier on the

  • ther side

ITILA, Ch 16

slide-60
SLIDE 60

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number ‘one’ to the soldier behind you. If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

slide-61
SLIDE 61

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number ‘one’ to the soldier behind you. If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

slide-62
SLIDE 62

Message Passing: Count the Soldiers

If you are the front soldier in the line, say the number ‘one’ to the soldier behind you. If you are the rearmost soldier in the line, say the number ‘one’ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side

ITILA, Ch 16

slide-63
SLIDE 63

What’s the Maximum Weighted Path?

9 6 7 3 32 1 4

slide-64
SLIDE 64

What’s the Maximum Weighted Path?

9 6 7 3 32 1 4

slide-65
SLIDE 65

What’s the Maximum Weighted Path?

9 6 7 3 32 1 4 +3 10 +3 7

slide-66
SLIDE 66

What’s the Maximum Weighted Path?

9 6 7 3 32 1 4 +3 10 +3 7

slide-67
SLIDE 67

What’s the Maximum Weighted Path?

9 6 7 3 32 1 4 +3 10 +3 7 +10 19 +10 16 +10 42 +10 11

slide-68
SLIDE 68

What’s the Maximum Weighted Path?

9 6 7 3 32 1 4 +3 10 +3 7 +10 19 +10 16 +10 42 +10 11

slide-69
SLIDE 69

Outline

Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

slide-70
SLIDE 70

What’s the Maximum Value?

consider “any shared path ending with B (AB, BB, or CB) B” maximize across the previous hidden state values

𝑤 𝑗, 𝐶 = max

𝑡′ 𝑤 𝑗 − 1, 𝑡′

∗ 𝑞 𝐶 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝐶)

v(i, B) is the maximum probability of any paths to that state B from the beginning (and emitting the

  • bservation)

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A

slide-71
SLIDE 71

What’s the Maximum Value?

consider “any shared path ending with B (AB, BB, or CB) B” maximize across the previous hidden state values

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi-2 = C zi-2 = B zi-2 = A

𝑤 𝑗, 𝐶 = max

𝑡′ 𝑤 𝑗 − 1, 𝑡′

∗ 𝑞 𝐶 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝐶)

v(i, B) is the maximum probability of any paths to that state B from the beginning (and emitting the

  • bservation)
slide-72
SLIDE 72

What’s the Maximum Value?

consider “any shared path ending with B (AB, BB, or CB) B” maximize across the previous hidden state values

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi-2 = C zi-2 = B zi-2 = A

computing v at time i-1 will correctly incorporate (maximize over) paths through time i-2: we correctly obey the Markov property

𝑤 𝑗, 𝐶 = max

𝑡′ 𝑤 𝑗 − 1, 𝑡′

∗ 𝑞 𝐶 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝐶)

v(i, B) is the maximum probability of any paths to that state B from the beginning (and emitting the

  • bservation)

Viterbi algorithm

slide-73
SLIDE 73

Viterbi Algorithm

v = double[N+2][K*] b = int[N+2][K*] v[*][*] = 0 v[0][START] = 1 for(i = 1; i ≤ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state | old) if(v[i-1][old] * pobs * pmove > v[i][state]) { v[i][state] = v[i-1][old] * pobs * pmove b[i][state] = old } } } }

backpointers/ book-keeping

slide-74
SLIDE 74

Outline

Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

slide-75
SLIDE 75

Forward Probability

α(i, B) is the total probability of all paths to that state B from the beginning

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi-2 = C zi-2 = B zi-2 = A

slide-76
SLIDE 76

Forward Probability

marginalize across the previous hidden state values 𝛽 𝑗, 𝐶 = ෍

𝑡′

𝛽 𝑗 − 1, 𝑡′ ∗ 𝑞 𝐶 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝐶) computing α at time i-1 will correctly incorporate paths through time i-2: we correctly obey the Markov property α(i, B) is the total probability of all paths to that state B from the beginning

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi-2 = C zi-2 = B zi-2 = A

slide-77
SLIDE 77

Forward Probability

α(i, s) is the total probability of all paths:

  • 1. that start from the beginning
  • 2. that end (currently) in s at step i
  • 3. that emit the observation obs at i

𝛽 𝑗, 𝑡 = ෍

𝑡′

𝛽 𝑗 − 1, 𝑡′ ∗ 𝑞 𝑡 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝑡)

slide-78
SLIDE 78

Forward Probability

α(i, s) is the total probability of all paths:

  • 1. that start from the beginning
  • 2. that end (currently) in s at step i
  • 3. that emit the observation obs at i

𝛽 𝑗, 𝑡 = ෍

𝑡′

𝛽 𝑗 − 1, 𝑡′ ∗ 𝑞 𝑡 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝑡)

how likely is it to get into state s this way? what are the immediate ways to get into state s? what’s the total probability up until now?

slide-79
SLIDE 79

Forward Probability

α(i, s) is the total probability of all paths:

  • 1. that start from the beginning
  • 2. that end (currently) in s at step i
  • 3. that emit the observation obs at i

𝛽 𝑗, 𝑡 = ෍

𝑡′

𝛽 𝑗 − 1, 𝑡′ ∗ 𝑞 𝑡 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝑡)

how likely is it to get into state s this way? what are the immediate ways to get into state s? what’s the total probability up until now?

Q: What do we return? (How do we return the likelihood of the sequence?)

A: α[N+1][end]

slide-80
SLIDE 80

Outline

Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

slide-81
SLIDE 81

Forward & Backward Message Passing

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi+1 = C zi+1 = B zi+1 = A

α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end

3. (that emit the observation obs at i+1)

slide-82
SLIDE 82

Forward & Backward Message Passing

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi+1 = C zi+1 = B zi+1 = A

α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end

3. (that emit the observation obs at i+1)

α(i, B) β(i, B)

slide-83
SLIDE 83

Forward & Backward Message Passing

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi+1 = C zi+1 = B zi+1 = A

α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end

3. (that emit the observation obs at i+1)

α(i, B) β(i, B) α(i, B) * β(i, B) = total probability of paths through state B at step i

slide-84
SLIDE 84

Forward & Backward Message Passing

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A

α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end

3. (that emit the observation obs at i+1)

α(i, B) β(i+1, s)

zi+1 = C zi+1 = B zi+1 = A

slide-85
SLIDE 85

Forward & Backward Message Passing

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A

α(i, B) β(i+1, s’)

zi+1 = C zi+1 = B zi+1 = A

α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end

3. (that emit the observation obs at i+1)

α(i, B) * p(s’ | B) * p(obs at i+1 | s’) * β(i+1, s’) = total probability of paths through the Bs’ arc (at time i)

slide-86
SLIDE 86

With Both Forward and Backward Values

α(i, s) * p(s’ | B) * p(obs at i+1 | s’) * β(i+1, s’) = total probability of paths through the ss’ arc (at time i) α(i, s) * β(i, s) = total probability of paths through state s at step i

𝑞 𝑨𝑗 = 𝑡 𝑥1, ⋯ , 𝑥𝑂) = 𝛽 𝑗, 𝑡 ∗ 𝛾(𝑗, 𝑡) 𝛽(𝑂 + 1, END) 𝑞 𝑨𝑗 = 𝑡, 𝑨𝑗+1 = 𝑡′ 𝑥1, ⋯ , 𝑥𝑂) = 𝛽 𝑗, 𝑡 ∗ 𝑞 𝑡′ 𝑡 ∗ 𝑞 obs𝑗+1 𝑡′ ∗ 𝛾(𝑗 + 1, 𝑡′) 𝛽(𝑂 + 1, END)

slide-87
SLIDE 87

Expectation Maximization (EM)

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

estimated counts

slide-88
SLIDE 88

Expectation Maximization (EM)

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

estimated counts

pobs(w | s) ptrans(s’ | s)

slide-89
SLIDE 89

Expectation Maximization (EM)

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

estimated counts

pobs(w | s) ptrans(s’ | s)

𝑞∗ 𝑨𝑗 = 𝑡 𝑥1, ⋯ , 𝑥𝑂) = 𝛽 𝑗, 𝑡 ∗ 𝛾(𝑗, 𝑡) 𝛽(𝑂 + 1, END) 𝑞∗ 𝑨𝑗 = 𝑡, 𝑨𝑗+1 = 𝑡′ 𝑥1, ⋯ , 𝑥𝑂) = 𝛽 𝑗, 𝑡 ∗ 𝑞 𝑡′ 𝑡 ∗ 𝑞 obs𝑗+1 𝑡′ ∗ 𝛾(𝑗 + 1, 𝑡′) 𝛽(𝑂 + 1, END)

slide-90
SLIDE 90

M-Step

“maximize log-likelihood, assuming these uncertain counts”

𝑞new 𝑡′ 𝑡) = 𝑑(𝑡 → 𝑡′) σ𝑦 𝑑(𝑡 → 𝑦)

if we observed the hidden transitions…

slide-91
SLIDE 91

M-Step

“maximize log-likelihood, assuming these uncertain counts”

𝑞new 𝑡′ 𝑡) = 𝔽𝑡→𝑡′[𝑑 𝑡 → 𝑡′ ] σ𝑦 𝔽𝑡→𝑦[𝑑 𝑡 → 𝑦 ]

we don’t the hidden transitions, but we can approximately count

slide-92
SLIDE 92

M-Step

“maximize log-likelihood, assuming these uncertain counts”

𝑞new 𝑡′ 𝑡) = 𝔽𝑡→𝑡′[𝑑 𝑡 → 𝑡′ ] σ𝑦 𝔽𝑡→𝑦[𝑑 𝑡 → 𝑦 ]

we don’t the hidden transitions, but we can approximately count

we compute these in the E-step, with

  • ur α and β values
slide-93
SLIDE 93

EM For HMMs (Baum-Welch Algorithm)

α = computeForwards() β = computeBackwards() L = α[N+1][END] for(i = N; i ≥ 0; --i) { for(next = 0; next < K*; ++next) { cobs(obsi+1 | next) += α[i+1][next]* β[i+1][next]/L for(state = 0; state < K*; ++state) { u = pobs(obsi+1 | next) * ptrans (next | state) ctrans(next| state) += α[i][state] * u * β[i+1][next]/L } } }

slide-94
SLIDE 94

Bayesian Networks: Directed Acyclic Graphs

𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = ෑ

𝑗

𝑞 𝑦𝑗 𝜌(𝑦𝑗))

“parents of” topological sort

slide-95
SLIDE 95

Bayesian Networks: Directed Acyclic Graphs

𝑞 𝑦1, 𝑦2, 𝑦3, … , 𝑦𝑂 = ෑ

𝑗

𝑞 𝑦𝑗 𝜌(𝑦𝑗))

exact inference in general DAGs is NP-hard inference in trees can be exact

slide-96
SLIDE 96

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z

d-separation

P has a chain with an observed middle node P has a fork with an observed parent node P includes a “v-structure” or “collider” with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Y X Y X Z Y

slide-97
SLIDE 97

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z

d-separation

P has a chain with an observed middle node P has a fork with an observed parent node P includes a “v-structure” or “collider” with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Z Y X Z Y X Z Y

  • bserving Z blocks

the path from X to Y

  • bserving Z blocks

the path from X to Y not observing Z blocks the path from X to Y

slide-98
SLIDE 98

D-Separation: Testing for Conditional Independence

Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z

d-separation

P has a chain with an observed middle node P has a fork with an observed parent node P includes a “v-structure” or “collider” with all unobserved descendants X & Y are d-separated if for all paths P, one of the following is true: X Z Y X Z Y X Z Y

  • bserving Z blocks

the path from X to Y

  • bserving Z blocks

the path from X to Y not observing Z blocks the path from X to Y 𝑞 𝑦, 𝑧, 𝑨 = 𝑞 𝑦 𝑞 𝑧 𝑞(𝑨|𝑦, 𝑧) 𝑞 𝑦, 𝑧 = ෍

𝑨

𝑞 𝑦 𝑞 𝑧 𝑞(𝑨|𝑦, 𝑧) = 𝑞 𝑦 𝑞 𝑧

slide-99
SLIDE 99

Markov Blanket

x Markov blanket of a node x is its parents, children, and children's parents

𝑞 𝑦𝑗 𝑦𝑘≠𝑗 = 𝑞(𝑦1, … , 𝑦𝑂) ∫ 𝑞 𝑦1, … , 𝑦𝑂 𝑒𝑦𝑗 = ς𝑙 𝑞(𝑦𝑙|𝜌 𝑦𝑙 ) ∫ ς𝑙 𝑞 𝑦𝑙 𝜌 𝑦𝑙 ) 𝑒𝑦𝑗

factor out terms not dependent on xi

factorization

  • f graph

= ς𝑙:𝑙=𝑗 or 𝑗∈𝜌 𝑦𝑙 𝑞(𝑦𝑙|𝜌 𝑦𝑙 ) ∫ ς𝑙:𝑙=𝑗 or 𝑗∈𝜌 𝑦𝑙 𝑞 𝑦𝑙 𝜌 𝑦𝑙 ) 𝑒𝑦𝑗

the set of nodes needed to form the complete conditional for a variable xi