Belief Propagation Matt Gormley Lecture 9 Sep. 25, 2019 1 - - PowerPoint PPT Presentation

belief propagation
SMART_READER_LITE
LIVE PREVIEW

Belief Propagation Matt Gormley Lecture 9 Sep. 25, 2019 1 - - PowerPoint PPT Presentation

10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University Belief Propagation Matt Gormley Lecture 9 Sep. 25, 2019 1 Q&A Q: What if I already answered a


slide-1
SLIDE 1

Belief Propagation

1

10-418 / 10-618 Machine Learning for Structured Data

Matt Gormley Lecture 9

  • Sep. 25, 2019

Machine Learning Department School of Computer Science Carnegie Mellon University

slide-2
SLIDE 2

Q&A

2

Q: What if I already answered a homework question using different assumptions than what was clarified in a Piazza note? A: Just write down the assumptions you made.

We will usually give credit so long as your assumptions are clear in the writeup and your answer correct under those assumptions. (Obviously, this only applies to underspecified / ambiguous

  • questions. You can’t just add arbitrary assumptions!)
slide-3
SLIDE 3

Reminders

  • Homework 1: DAgger for seq2seq

– Out: Thu, Sep. 12 – Due: Thu, Sep. 26 at 11:59pm

  • Homework 2: Labeling Syntax Trees

– Out: Thu, Sep. 26 – Due: Thu, Oct. 10 at 11:59pm

3

slide-4
SLIDE 4

Variable Elimination Complexity

Brute force, naïve, inference is O(____) Variable elimination is O(____)

4

X3 X1

ψ12

X2

ψ23

X4

ψ45

X5

ψ13 ψ234 ψ5

Instead, capitalize on the factorization of p(x).

where n = # of variables

k = max # values a variable can take r = # variables participating in largest “intermediate” table

In-Class Exercise: Fill in the blank

slide-5
SLIDE 5

Exact Inference

Variable Elimination

  • Uses

– Computes the partition function of any factor graph – Computes the marginal probability of a query variable in any factor graph

  • Limitations

– Only computes the marginal for one variable at a time (i.e. need to re-run variable elimination for each variable if you need them all) – Elimination order affects runtime

Belief Propagation

  • Uses

– Computes the partition function of any acyclic factor graph – Computes all marginal probabilities of factors and variables at once, for any acyclic factor graph

  • Limitations

– Only exact on acyclic factor graphs (though we’ll consider its “loopy” variant later) – Message passing order affects runtime (but the

  • bvious topological ordering

always works best)

6

slide-6
SLIDE 6

MESSAGE PASSING

7

slide-7
SLIDE 7

Great Ideas in ML: Message Passing

3 behind you 2 behind you 1 behind you 4 behind you 5 behind you 1 before you 2 before you there's 1 of me 3 before you 4 before you 5 before you

Count the soldiers

8

adapted from MacKay (2003) textbook

slide-8
SLIDE 8

Great Ideas in ML: Message Passing

3 behind you 2 before you there's 1 of me

Belief: Must be 2 + 1 + 3 = 6 of us

  • nly see

my incoming messages 2 3 1

Count the soldiers

9

adapted from MacKay (2003) textbook

2 before you

slide-9
SLIDE 9

Great Ideas in ML: Message Passing

4 behind you 1 before you there's 1 of me

  • nly see

my incoming messages

Count the soldiers

10

adapted from MacKay (2003) textbook Belief: Must be 2 + 1 + 3 = 6 of us 2 3 1 Belief: Must be 1 + 1 + 4 = 6 of us 1 4 1

slide-10
SLIDE 10

Great Ideas in ML: Message Passing

7 here 3 here 11 here (= 7+3+1) 1 of me

Each soldier receives reports from all branches of tree

11

adapted from MacKay (2003) textbook

slide-11
SLIDE 11

Great Ideas in ML: Message Passing

3 here 3 here 7 here (= 3+3+1)

Each soldier receives reports from all branches of tree

12

adapted from MacKay (2003) textbook

slide-12
SLIDE 12

Great Ideas in ML: Message Passing

7 here 3 here 11 here (= 7+3+1)

Each soldier receives reports from all branches of tree

13

adapted from MacKay (2003) textbook

slide-13
SLIDE 13

Great Ideas in ML: Message Passing

7 here 3 here 3 here Belief: Must be 14 of us

Each soldier receives reports from all branches of tree

14

adapted from MacKay (2003) textbook

slide-14
SLIDE 14

Great Ideas in ML: Message Passing

Each soldier receives reports from all branches of tree

7 here 3 here 3 here Belief: Must be 14 of us wouldn't work correctly with a 'loopy' (cyclic) graph

15

adapted from MacKay (2003) textbook

slide-15
SLIDE 15

SUM-PRODUCT BELIEF PROPAGATION

Exact marginal inference for factor trees

16

slide-16
SLIDE 16

Message Passing in Belief Propagation

17

X

Ψ

… … … …

My other factors think I’m a noun But my other variables and I think you’re a verb

v 1 n 6 a 3 v 6 n 1 a 3 v 6 n 6 a 9

Both of these messages judge the possible values of variable X. Their product = belief at X = product of all 3 messages to X.

slide-17
SLIDE 17

Sum-Product Belief Propagation

18

Beliefs Messages Variables Factors

X2

ψ1

X1 X3 X1

ψ2 ψ3 ψ1

X1

ψ2 ψ3 ψ1

X2

ψ1

X1 X3

slide-18
SLIDE 18

X1

ψ2 ψ3 ψ1

v 0.1 n 3 p 1

Sum-Product Belief Propagation

19

v 1 n 2 p 2 v 4 n 1 p v .4 n 6 p

Variable Belief

slide-19
SLIDE 19

X1

ψ2 ψ3 ψ1

Sum-Product Belief Propagation

20

v 0.1 n 6 p 2

Variable Message

v 0.1 n 3 p 1 v 1 n 2 p 2

slide-20
SLIDE 20

Sum-Product Belief Propagation

21

Factor Belief

ψ1

X1 X3 v 8 n 0.2 p 4 d 1 n v n p 0.1 8 d 3 n 1 1 v n p 3.2 6.4 d 24 n

slide-21
SLIDE 21

Sum-Product Belief Propagation

22

Factor Belief

ψ1

X1 X3 v n p 3.2 6.4 d 24 n

slide-22
SLIDE 22

Sum-Product Belief Propagation

23

ψ1

X1 X3 v 8 n 0.2 v n p 0.1 8 d 3 n 1 1 p 0.8 + 0.16 d 24 + 0 n 8 + 0.2

Factor Message

slide-23
SLIDE 23

Sum-Product Belief Propagation

Factor Message

24

ψ1

X1 X3

matrix-vector product

(for a binary factor)

slide-24
SLIDE 24

Input: a factor graph with no cycles Output: exact marginals for each variable and factor Algorithm: 1. Initialize the messages to the uniform distribution. 1. Choose a root node. 2. Send messages from the leaves to the root. Send messages from the root to the leaves. 1. Compute the beliefs (unnormalized marginals). 2. Normalize beliefs and return the exact marginals.

Sum-Product Belief Propagation

25

slide-25
SLIDE 25

Sum-Product Belief Propagation

26

Beliefs Messages Variables Factors

X2

ψ1

X1 X3 X1

ψ2 ψ3 ψ1

X1

ψ2 ψ3 ψ1

X2

ψ1

X1 X3

slide-26
SLIDE 26

Sum-Product Belief Propagation

27

Beliefs Messages Variables Factors

X2

ψ1

X1 X3 X1

ψ2 ψ3 ψ1

X1

ψ2 ψ3 ψ1

X2

ψ1

X1 X3

slide-27
SLIDE 27

FORWARD BACKWARD AS SUM-PRODUCT BP

28

slide-28
SLIDE 28

X2 X3 X1

CRF Tagging Model

29

find preferred tags Could be adjective or verb Could be noun or verb Could be verb or noun

slide-29
SLIDE 29

30

… …

find preferred tags

CRF Tagging by Belief Propagation

v 0.3 n 0 a 0.1 v 1.8 n 0 a 4.2

α β α

belief message message

v 2 n 1 a 7

  • Forward-backward is a message passing algorithm.
  • It’s the simplest case of belief propagation.

v 7 n 2 a 1 v 3 n 1 a 6

β

v n a v 0 2 1 n 2 1 0 a 0 3 1 v 3 n 6 a 1 v n a v 0 2 1 n 2 1 0 a 0 3 1

Forward algorithm = message passing

(matrix-vector products)

Backward algorithm = message passing

(matrix-vector products)

slide-30
SLIDE 30

X2 X3 X1

31

find preferred tags Could be adjective or verb Could be noun or verb Could be verb or noun

So Let’s Review Forward-Backward …

slide-31
SLIDE 31

X2 X3 X1

So Let’s Review Forward-Backward …

32

v n a v n a v n a

START END

  • Show the possible values for each variable

find preferred tags

slide-32
SLIDE 32

X2 X3 X1

33

v n a v n a v n a

START END

  • Let’s show the possible values for each variable
  • One possible assignment

find preferred tags

So Let’s Review Forward-Backward …

slide-33
SLIDE 33

X2 X3 X1

34

v n a v n a v n a

START END

  • Let’s show the possible values for each variable
  • One possible assignment
  • And what the 7 factors think of it …

find preferred tags

So Let’s Review Forward-Backward …

slide-34
SLIDE 34

X2 X3 X1

Viterbi Algorithm: Most Probable Assignment

35

v n a v n a v n a

START END

find preferred tags

  • So p(v a n) = (1/Z) * product of 7 numbers
  • Numbers associated with edges and nodes of path
  • Most probable assignment = path with highest product

ψ

{ , 1 }

(START , v ) ψ{1,2}(v,a) ψ

{ 2 , 3 }

( a , n ) ψ{3,4}(a,END) ψ{1}(v) ψ{2}(a) ψ{3}(n)

slide-35
SLIDE 35

X2 X3 X1

Viterbi Algorithm: Most Probable Assignment

36

v n a v n a v n a

START END

find preferred tags

  • So p(v a n) = (1/Z) * product weight of one path

ψ

{ , 1 }

(START , v ) ψ{1,2}(v,a) ψ

{ 2 , 3 }

( a , n ) ψ{3,4}(a,END) ψ{1}(v) ψ{2}(a) ψ{3}(n)

slide-36
SLIDE 36

X2 X3 X1

Forward-Backward Algorithm: Finds Marginals

37

v n a v n a v n a

START END

find preferred tags

  • So p(v a n) = (1/Z) * product weight of one path
  • Marginal probability p(X2 = a)

= (1/Z) * total weight of all paths through

a

slide-37
SLIDE 37

X2 X3 X1

Forward-Backward Algorithm: Finds Marginals

38

v n a v n a v n a

START END

find preferred tags

  • So p(v a n) = (1/Z) * product weight of one path
  • Marginal probability p(X2 = a)

= (1/Z) * total weight of all paths through

n

slide-38
SLIDE 38

X2 X3 X1

Forward-Backward Algorithm: Finds Marginals

39

v n a v n a v n a

START END

find preferred tags

  • So p(v a n) = (1/Z) * product weight of one path
  • Marginal probability p(X2 = a)

= (1/Z) * total weight of all paths through

v

slide-39
SLIDE 39

X2 X3 X1

Forward-Backward Algorithm: Finds Marginals

40

v n a v n a v n a

START END

find preferred tags

  • So p(v a n) = (1/Z) * product weight of one path
  • Marginal probability p(X2 = a)

= (1/Z) * total weight of all paths through

n

slide-40
SLIDE 40

α2(n) = total weight of these path prefixes (found by dynamic programming: matrix-vector products)

X2 X3 X1

Forward-Backward Algorithm: Finds Marginals

41

v n a v n a v n a

START END

find preferred tags

slide-41
SLIDE 41

= total weight of these path suffixes

X2 X3 X1

Forward-Backward Algorithm: Finds Marginals

42

v n a v n a v n a

START END

find preferred tags b2(n) (found by dynamic programming: matrix-vector products)

slide-42
SLIDE 42

α2(n) = total weight of these path prefixes = total weight of these path suffixes

X2 X3 X1

Forward-Backward Algorithm: Finds Marginals

43

v n a v n a v n a

START END

find preferred tags b2(n)

(a + b + c) (x + y + z)

Product gives ax+ay+az+bx+by+bz+cx+cy+cz = total weight of paths

slide-43
SLIDE 43

total weight of all paths through = × ×

X2 X3 X1

Forward-Backward Algorithm: Finds Marginals

44

v n a v n a v n a

START END

find preferred tags

n

ψ{2}(n) α2(n) b2(n ) α2(n) ψ{2}(n) b2(n ) “belief that X2 = n”

Oops! The weight of a path through a state also includes a weight at that state. So α(n)·β(n) isn’t enough. The extra weight is the

  • pinion of the unigram

factor at this variable.

slide-44
SLIDE 44

X2 X3 X1

Forward-Backward Algorithm: Finds Marginals

45

v n a n a v n a

START END

find preferred tags ψ{2}(v) α2(v) b2(v) “belief that X2 = v”

v

“belief that X2 = n”

total weight of all paths through = × ×

v

α2(v) ψ{2}(v) b2(v)

slide-45
SLIDE 45

X2 X3 X1

Forward-Backward Algorithm: Finds Marginals

46

v n a v n a v n a

START END

find preferred tags ψ{2}(a) α2(a) b2(a) “belief that X2 = a” “belief that X2 = v” “belief that X2 = n” sum = Z (total probability

  • f all paths)

v 1.8 n 0 a 4.2 v 0.3 n 0 a 0.7

divide by Z=6 to get marginal probs

total weight of all paths through = × ×

a

α2(a) ψ{2}(a) b2(a)

slide-46
SLIDE 46

BP AS DYNAMIC PROGRAMMING

47

slide-47
SLIDE 47

(Acyclic) Belief Propagation

48

In a factor graph with no cycles:

  • 1. Pick any node to serve as the root.
  • 2. Send messages from the leaves to the root.
  • 3. Send messages from the root to the leaves.

A node computes an outgoing message along an edge

  • nly after it has received incoming messages along all its other edges.

X1 ψ1 X2 ψ3 X3 ψ5 X4 ψ7 X5 ψ9 time like flies an arrow X6 ψ10 X8 ψ12 X7 ψ11 X9 ψ13

slide-48
SLIDE 48

(Acyclic) Belief Propagation

49 X1 ψ1 X2 ψ3 X3 ψ5 X4 ψ7 X5 ψ9 time like flies an arrow X6 ψ10 X8 ψ12 X7 ψ11 X9 ψ13

In a factor graph with no cycles:

  • 1. Pick any node to serve as the root.
  • 2. Send messages from the leaves to the root.
  • 3. Send messages from the root to the leaves.

A node computes an outgoing message along an edge

  • nly after it has received incoming messages along all its other edges.
slide-49
SLIDE 49

Acyclic BP as Dynamic Programming

50

X1 ψ1 X2 ψ3 X3 ψ5 X4 ψ7 X5 ψ9 time like flies an arrow X6 ψ10 ψ12

Xi

ψ14 X9 ψ13 ψ11

F G H Figure adapted from Burkett & Klein (2012)

Subproblem: Inference using just the factors in subgraph H

slide-50
SLIDE 50

Acyclic BP as Dynamic Programming

51

X1 X2 X3 X4 ψ7 X5 ψ9 time like flies an arrow X6 ψ10

Xi

X9 ψ11

H

Subproblem: Inference using just the factors in subgraph H The marginal of Xi in that smaller model is the message sent to Xi from subgraph H

Message to a variable

slide-51
SLIDE 51

Acyclic BP as Dynamic Programming

52

X1 X2 X3 ψ5 X4 X5 time like flies an arrow X6

Xi

ψ14 X9

G

Subproblem: Inference using just the factors in subgraph H The marginal of Xi in that smaller model is the message sent to Xi from subgraph H

Message to a variable

slide-52
SLIDE 52

Acyclic BP as Dynamic Programming

53

X1 ψ1 X2 ψ3 X3 X4 X5 time like flies an arrow X6 X

8

ψ12

Xi

X9 ψ13

F

Subproblem: Inference using just the factors in subgraph H The marginal of Xi in that smaller model is the message sent to Xi from subgraph H

Message to a variable

slide-53
SLIDE 53

Acyclic BP as Dynamic Programming

54

X1 ψ1 X2 ψ3 X3 ψ5 X4 ψ7 X5 ψ9 time like flies an arrow X6 ψ10 ψ12

Xi

ψ14 X9 ψ13 ψ11

F H

Subproblem: Inference using just the factors in subgraph FÈH The marginal of Xi in that smaller model is the message sent by Xi

  • ut of subgraph FÈH

Message from a variable

slide-54
SLIDE 54
  • If you want the marginal pi(xi) where Xi has degree k, you can think of that

summation as a product of k marginals computed on smaller subgraphs.

  • Each subgraph is obtained by cutting some edge of the tree.
  • The message-passing algorithm uses dynamic programming to compute

the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals.

Acyclic BP as Dynamic Programming

55

time like flies an arrow X1 ψ1 X2 ψ3 X3 ψ5 X4 ψ7 X5 ψ9 X6 ψ10 X8 ψ12 X7 ψ14 X9 ψ13 ψ11

slide-55
SLIDE 55
  • If you want the marginal pi(xi) where Xi has degree k, you can think of that

summation as a product of k marginals computed on smaller subgraphs.

  • Each subgraph is obtained by cutting some edge of the tree.
  • The message-passing algorithm uses dynamic programming to compute

the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals.

Acyclic BP as Dynamic Programming

56

time like flies an arrow X1 ψ1 X2 ψ3 X3 ψ5 X4 ψ7 X5 ψ9 X6 ψ10 X8 ψ12 X7 ψ14 X9 ψ13 ψ11

slide-56
SLIDE 56
  • If you want the marginal pi(xi) where Xi has degree k, you can think of that

summation as a product of k marginals computed on smaller subgraphs.

  • Each subgraph is obtained by cutting some edge of the tree.
  • The message-passing algorithm uses dynamic programming to compute

the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals.

Acyclic BP as Dynamic Programming

57

time like flies an arrow X1 ψ1 X2 ψ3 X3 ψ5 X4 ψ7 X5 ψ9 X6 ψ10 X8 ψ12 X7 ψ14 X9 ψ13 ψ11

slide-57
SLIDE 57
  • If you want the marginal pi(xi) where Xi has degree k, you can think of that

summation as a product of k marginals computed on smaller subgraphs.

  • Each subgraph is obtained by cutting some edge of the tree.
  • The message-passing algorithm uses dynamic programming to compute

the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals.

Acyclic BP as Dynamic Programming

58

time like flies an arrow X1 ψ1 X2 ψ3 X3 ψ5 X4 ψ7 X5 ψ9 X6 ψ10 X8 ψ12 X7 ψ14 X9 ψ13 ψ11

slide-58
SLIDE 58
  • If you want the marginal pi(xi) where Xi has degree k, you can think of that

summation as a product of k marginals computed on smaller subgraphs.

  • Each subgraph is obtained by cutting some edge of the tree.
  • The message-passing algorithm uses dynamic programming to compute

the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals.

Acyclic BP as Dynamic Programming

59

time like flies an arrow X1 ψ1 X2 ψ3 X3 ψ5 X4 ψ7 X5 ψ9 X6 ψ10 X8 ψ12 X7 ψ14 X9 ψ13 ψ11

slide-59
SLIDE 59
  • If you want the marginal pi(xi) where Xi has degree k, you can think of that

summation as a product of k marginals computed on smaller subgraphs.

  • Each subgraph is obtained by cutting some edge of the tree.
  • The message-passing algorithm uses dynamic programming to compute

the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals.

Acyclic BP as Dynamic Programming

60

time like flies an arrow X1 ψ1 X2 ψ3 X3 ψ5 X4 ψ7 X5 ψ9 X6 ψ10 X8 ψ12 X7 ψ14 X9 ψ13 ψ11

slide-60
SLIDE 60
  • If you want the marginal pi(xi) where Xi has degree k, you can think of that

summation as a product of k marginals computed on smaller subgraphs.

  • Each subgraph is obtained by cutting some edge of the tree.
  • The message-passing algorithm uses dynamic programming to compute

the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals.

Acyclic BP as Dynamic Programming

61

time like flies an arrow X1 ψ1 X2 ψ3 X3 ψ5 X4 ψ7 X5 ψ9 X6 ψ10 X8 ψ12 X7 ψ14 X9 ψ13 ψ11

slide-61
SLIDE 61

MAX-PRODUCT BELIEF PROPAGATION

Exact MAP inference for factor trees

62

slide-62
SLIDE 62

Max-product Belief Propagation

  • Sum-product BP can be used to

compute the marginals, pi(Xi) compute the partition function, Z

  • Max-product BP can be used to

compute the most likely assignment, X* = argmaxX p(X)

63

slide-63
SLIDE 63

Max-product Belief Propagation

  • Change the sum to a max:
  • Max-product BP computes max-marginals

– The max-marginal bi(xi) is the (unnormalized) probability of the MAP assignment under the constraint Xi = xi. – For an acyclic graph, the MAP assignment (assuming there are no ties) is given by:

64

slide-64
SLIDE 64

Max-product Belief Propagation

  • Change the sum to a max:
  • Max-product BP computes max-marginals

– The max-marginal bi(xi) is the (unnormalized) probability of the MAP assignment under the constraint Xi = xi. – For an acyclic graph, the MAP assignment (assuming there are no ties) is given by:

65

slide-65
SLIDE 65

Deterministic Annealing

Motivation: Smoothly transition from sum-product to max- product 1. Incorporate inverse temperature parameter into each factor: 1. Send messages as usual for sum-product BP 2. Anneal T from 1 to 0: 3. Take resulting beliefs to power T

66

Annealed Joint Distribution

T = 1 Sum-product T à 0 Max-product

slide-66
SLIDE 66

Semirings

  • Sum-product +/* and max-product max/* are

commutative semirings

  • We can run BP with any such commutative

semiring

  • In practice, multiplying many small numbers

together can yield underflow

– instead of using +/*, we use log-add/+ – Instead of using max/*, we use max/+

67

slide-67
SLIDE 67

FORWARD-BACKWARD AND VITERBI ALGORITHMS

Exact inference for linear chain models

68

slide-68
SLIDE 68

Forward-Backward Algorithm

  • Sum-product BP on an HMM is called the

forward-backward algorithm

  • Max-product BP on an HMM is called the

Viterbi algorithm

69

slide-69
SLIDE 69

Forward-Backward Algorithm

70

Trigram HMM is not a tree, even when converted to a factor graph

time like flies an arrow X1 X2 X3 X4 X5 W1 W2 W3 W4 W5

slide-70
SLIDE 70

Forward-Backward Algorithm

71

Trigram HMM is not a tree, even when converted to a factor graph

time like flies an arrow X1 ψ1 ψ2 X2 ψ3 ψ4 X3 ψ5 ψ6 X4 ψ7 ψ8 X5 ψ9 ψ10 ψ11 ψ12

slide-71
SLIDE 71

Forward-Backward Algorithm

72

Trigram HMM is not a tree, even when converted to a factor graph

time like flies an arrow X1 ψ1 ψ2 X2 ψ3 ψ4 X3 ψ5 ψ6 X4 ψ7 ψ8 X5 ψ9 ψ10 ψ11 ψ12

Trick: (See also Sha & Pereira (2003))

  • Replace each variable domain with its cross product

e.g. {B,I,O} à {BB, BI, BO, IB, II, IO, OB, OI, OO}

  • Replace each pair of variables with a single one. For all i, yi,i+1 = (xi, xi+1)
  • Add features with weight -∞ that disallow illegal configurations

between pairs of the new variables e.g. legal = BI and IO illegal = II and OO

  • This is effectively a special case of the junction tree algorithm
slide-72
SLIDE 72

Summary

1. Factor Graphs

– Alternative representation of directed / undirected graphical models – Make the cliques of an undirected GM explicit

2. Variable Elimination

– Simple and general approach to exact inference – Just a matter of being clever when computing sum-products

3. Sum-product Belief Propagation

– Computes all the marginals and the partition function in only twice the work of Variable Elimination

4. Max-product Belief Propagation

– Identical to sum-product BP, but changes the semiring – Computes: max-marginals, probability of MAP assignment, and (with backpointers) the MAP assignment itself.

73