Natural Language Processing (CSE 517): Graphical Models Noah Smith - - PowerPoint PPT Presentation

natural language processing cse 517 graphical models
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing (CSE 517): Graphical Models Noah Smith - - PowerPoint PPT Presentation

Natural Language Processing (CSE 517): Graphical Models Noah Smith 2016 c University of Washington nasmith@cs.washington.edu February 810, 2016 1 / 77 Notation Let V = V 1 , V 2 , . . . , V be a collection of random


slide-1
SLIDE 1

Natural Language Processing (CSE 517): Graphical Models

Noah Smith

c 2016 University of Washington nasmith@cs.washington.edu

February 8–10, 2016

1 / 77

slide-2
SLIDE 2

Notation

Let V = V1, V2, . . . , Vℓ be a collection of random variables (not necessarily a sequence). Val(V ) will denote the values of a r.v. V . V I denotes a subset of the r.v.s V with indices i ∈ I. V ¬I = V \ V I Recall:

◮ p(V ) = ℓ i=1 p(Vi | V1, . . . , Vi−1) (always true, for any

  • rdering)

◮ p(V I, V J | V K) = p(V I | V K) · p(V J | V K) if and only if

V I⊥V J | V K (conditional independence)

◮ p(V I = vI) = v¬I∈Val(V ¬I) p(V I = vI, V ¬I = v¬I)

(marginalization)

2 / 77

slide-3
SLIDE 3

Factor Graphs

Two kinds of vertices:

◮ Random variables (denoted by circles, “Vi”) ◮ Factors (denoted by squares, “fj”)

The graph is bipartite; every edge connects some variable to some

  • factor. Let Ij ⊆ {1, . . . , ℓ} be the set of variables fj is connected

to. Factor fj defines a map Val(V Ij) → R≥0. The graph and factors define a probability distribution: p(V = v) ∝

  • j

fj(vIj)

3 / 77

slide-4
SLIDE 4

Factor Graphs We’ve Seen Before

Hidden Markov model:

x1 x2 x3 x4 y1 y2 y3 y4 y0 y5

General first-order sequence model:

x y0 y1 y2 y3 y5 y4

4 / 77

slide-5
SLIDE 5

Two Kinds of Factors

Conditional probability tables. E.g., if Ij = {1, 2, 3}: fj(v1, v2, v3) = p(V3 = v3 | V1 = v1, V2 = v2) Lead to Bayesian networks (with some constraints). Potential functions (arbitrary nonnegative values). Lead to Markov random fields (a.k.a. Markov networks).

5 / 77

slide-6
SLIDE 6

Yucky Bayesian Network

Influenza Allergies Sinus Inflamm. Runny Nose Headache

Sinus inflammation is caused by flu, but also by allergies. Runny nose and headache are both caused by sinus inflammation.

6 / 77

slide-7
SLIDE 7

Yucky Factor Graph

Influenza Allergies Sinus Inflamm. Runny Nose Headache

Sinus inflammation is caused by flu, but also by allergies. Runny nose and headache are both caused by sinus inflammation.

7 / 77

slide-8
SLIDE 8

Yucky Factor Graph

Influenza Allergies Sinus Inflamm. Runny Nose Headache

I fI 1 I fA 1 S I A fS,I,A 1 1 1 1 1 1 1 1 1 1 1 1 R S fR,S 1 1 1 1 H S fH,S 1 1 1 1

8 / 77

slide-9
SLIDE 9

Yucky Factor Graph

Influenza Allergies Sinus Inflamm. Runny Nose Headache Influenza Allergies Sinus Inflamm. Runny Nose Headache

I fI 1 I fA 1 S I A fS,I,A 1 1 1 1 1 1 1 1 1 1 1 1 R S fR,S 1 1 1 1 H S fH,S 1 1 1 1

p(i, a, s, r, h) = fI(i) · fA(a) · fS,I,A(s, i, a) · fR,S(r, s) · fH,S(h, s) = p(i) · p(a) · p(s | i, a) · p(r | s) · p(h | s)

9 / 77

slide-10
SLIDE 10

Naughty Markov Random Field

Adrian Brook Chris Dana

Independencies: A⊥C | B, D; B⊥D | A, C; ¬A⊥C; ¬B⊥D

10 / 77

slide-11
SLIDE 11

Naughty Factor Graph

Adrian Brook Chris Dana A B fA,B 1 1 1 1 B C fB,C 1 1 1 1 C D fC,D 1 1 1 1 D A fD,A 1 1 1 1

p(a, b, c, d) = fA,B(a, b) · fB,C(b, c) · fC,D(c, d) · fD,A(d, a)

  • a′ ∈

Val(A)

  • b′ ∈

Val(B)

  • c′ ∈

Val(C)

  • d′ ∈

Val(D)

fA,B(a′, b′) · fB,C(b′, c′) · fC,D(c′, d′) · fD,A(d′, a′)

11 / 77

slide-12
SLIDE 12

Assignment Probabilities: Examples

Adrian Brook Chris Dana A B fA,B 30 1 5 1 1 1 1 10 B C fB,C 100 1 1 1 1 1 1 100 C D fC,D 1 1 100 1 100 1 1 1 D A fD,A 100 1 1 1 1 1 1 100

12 / 77

slide-13
SLIDE 13

Assignment Probabilities: Examples

Adrian Brook Chris Dana A B fA,B 30 1 5 1 1 1 1 10 B C fB,C 100 1 1 1 1 1 1 100 C D fC,D 1 1 100 1 100 1 1 1 D A fD,A 100 1 1 1 1 1 1 100

  • a′ ∈

Val(A)

  • b′ ∈

Val(B)

  • c′ ∈

Val(C)

  • d′ ∈

Val(D)

fA,B(a′, b′) · fB,C(b′, c′) · fC,D(c′, d′) · fD,A(d′, a′) = 7, 201, 840

13 / 77

slide-14
SLIDE 14

Assignment Probabilities: Examples

Adrian Brook Chris Dana A B fA,B 30 1 5 1 1 1 1 10 B C fB,C 100 1 1 1 1 1 1 100 C D fC,D 1 1 100 1 100 1 1 1 D A fD,A 100 1 1 1 1 1 1 100

p(A = 0, B = 1, C = 1, D = 0) = 5, 000, 000 7, 201, 840 ≈ 0.69

14 / 77

slide-15
SLIDE 15

Assignment Probabilities: Examples

Adrian Brook Chris Dana A B fA,B 30 1 5 1 1 1 1 10 B C fB,C 100 1 1 1 1 1 1 100 C D fC,D 1 1 100 1 100 1 1 1 D A fD,A 100 1 1 1 1 1 1 100

p(A = 1, B = 1, C = 0, D = 0) = 10 7, 201, 840 ≈ 0.0000014

15 / 77

slide-16
SLIDE 16

Structure and Independence

Bayesian networks:

◮ A variable is conditionally independent of its non-descendants

given its parents. Markov networks:

◮ Conditional independence derived from “Markov blanket” and

separation properties. Local configurations can be used to check all conditional independence questions; almost no need to look at the values in the factors!

16 / 77

slide-17
SLIDE 17

Independence “Spectrum”

  • i=1

fVi(Vi) fV (V ) everything is independent everything can be interdependent minimal expressive power arbitrary expressive power fewer parameters more parameters

17 / 77

slide-18
SLIDE 18

Operations on Factors: Multiplication

Given two factors fU and fV , we can create a new “product” factor such that: fU∪V (u ∪ v) = fU(u) · fV (v) for all u ∈ Val(U) and all v ∈ Val(V ).

A B fA,B 30 1 5 1 1 1 1 10

·

B C fB,C 100 1 1 1 1 1 1 100

=

A B C fA,B,C 3,000 1 30 1 5 1 1 500 1 100 1 1 1 1 1 10 1 1 1 1,000

18 / 77

slide-19
SLIDE 19

Operations on Factors: Multiplication

Given two factors fU and fV , we can create a new “product” factor such that: fU∪V (u ∪ v) = fU(u) · fV (v) for all u ∈ Val(U) and all v ∈ Val(V ).

A B fA,B 30 1 5 1 1 1 1 10

·

B C fB,C 100 1 1 1 1 1 1 100

=

A B C fA,B,C 3,000 1 30 1 5 1 1 500 1 100 1 1 1 1 1 10 1 1 1 1,000

This might remind you of a join operation on a database.

19 / 77

slide-20
SLIDE 20

Operations on Factors: Multiplication

Given two factors fU and fV , we can create a new “product” factor such that: fU∪V (u ∪ v) = fU(u) · fV (v) for all u ∈ Val(U) and all v ∈ Val(V ).

A B fA,B 30 1 5 1 1 1 1 10

·

B C fB,C 100 1 1 1 1 1 1 100

=

A B C fA,B,C 3,000 1 30 1 5 1 1 500 1 100 1 1 1 1 1 10 1 1 1 1,000

What happens if you multiply out all the factors in a factor graph?

20 / 77

slide-21
SLIDE 21

Operations on Factors: Maximization

Given a factor fU and a variable V ∈ U, we can transform fU,V into fU by: fU(u) = max

v∈Val(V ) fU,V (u, v)

for all u ∈ Val(U).

A C fA,C 3,000 B = 0 1 500 B = 1 1 100 B = 0 1 1 1,000 B = 1

= max

B

A B C fA,B,C 3,000 1 30 1 5 1 1 500 1 100 1 1 1 1 1 10 1 1 1 1,000

21 / 77

slide-22
SLIDE 22

Operations on Factors: Marginalization

Given a factor fU and a variable V ∈ U, we can transform fU,V into fU by: fU(u) =

  • v∈Val(V )

fU,V (u, v) for all u ∈ Val(U).

A C fA,C 3,000 + 5 1 30 + 500 1 100 + 10 1 1 1 + 1,000

=

  • B

A B C fA,B,C 3,000 1 30 1 5 1 1 500 1 100 1 1 1 1 1 10 1 1 1 1,000

22 / 77

slide-23
SLIDE 23

Operations on Factors: Marginalization

Given a factor fU and a variable V ∈ U, we can transform fU,V into fU by: fU(u) =

  • v∈Val(V )

fU,V (u, v) for all u ∈ Val(U).

A C fA,C 3,000 + 5 1 30 + 500 1 100 + 10 1 1 1 + 1,000

=

  • B

A B C fA,B,C 3,000 1 30 1 5 1 1 500 1 100 1 1 1 1 1 10 1 1 1 1,000

If you multiply out all the factors in a factor graph, then sum out each variable, one by one, until none are left, what do you get?

23 / 77

slide-24
SLIDE 24

Factors are like numbers.

◮ Products are commutative: f1 · f2 = f2 · f1

24 / 77

slide-25
SLIDE 25

Factors are like numbers.

◮ Products are commutative: f1 · f2 = f2 · f1 ◮ Products are associative: (f1 · f2) · f3 = f1 · (f2 · f3)

25 / 77

slide-26
SLIDE 26

Factors are like numbers.

◮ Products are commutative: f1 · f2 = f2 · f1 ◮ Products are associative: (f1 · f2) · f3 = f1 · (f2 · f3) ◮ Sums are commutative:

  • X
  • Y

f =

  • Y
  • X

f

26 / 77

slide-27
SLIDE 27

Factors are like numbers.

◮ Products are commutative: f1 · f2 = f2 · f1 ◮ Products are associative: (f1 · f2) · f3 = f1 · (f2 · f3) ◮ Sums are commutative:

  • X
  • Y

f =

  • Y
  • X

f

◮ Maximizations are commutative: max X

max

Y

f = max

Y

max

X

f

27 / 77

slide-28
SLIDE 28

Factors are like numbers.

◮ Products are commutative: f1 · f2 = f2 · f1 ◮ Products are associative: (f1 · f2) · f3 = f1 · (f2 · f3) ◮ Sums are commutative:

  • X
  • Y

f =

  • Y
  • X

f

◮ Maximizations are commutative: max X

max

Y

f = max

Y

max

X

f

◮ Multiplication distributes over marginalization and

maximization:

  • X

(f1 · f2) = f1 ·

  • X

f2 max

X (f1 · f2) = f1 · max X

f2 (assuming X is not in the scope of f1).

28 / 77

slide-29
SLIDE 29

Inference

Most general definition: “reason about some variables, optionally given values of some others.” Let O be the observed variables and U be the unobserved ones; V = O ∪ U. Three inference problems, all given O = o . . .

29 / 77

slide-30
SLIDE 30

Inference

Most general definition: “reason about some variables, optionally given values of some others.” Let O be the observed variables and U be the unobserved ones; V = O ∪ U. Three inference problems, all given O = o . . .

◮ Marginal inference: what is the marginal distribution over

Q ⊂ U? (p(Q | o), marginalizing out the rest.)

30 / 77

slide-31
SLIDE 31

Inference

Most general definition: “reason about some variables, optionally given values of some others.” Let O be the observed variables and U be the unobserved ones; V = O ∪ U. Three inference problems, all given O = o . . .

◮ Marginal inference: what is the marginal distribution over

Q ⊂ U? (p(Q | o), marginalizing out the rest.)

◮ Related: draw samples from that distribution. 31 / 77

slide-32
SLIDE 32

Inference

Most general definition: “reason about some variables, optionally given values of some others.” Let O be the observed variables and U be the unobserved ones; V = O ∪ U. Three inference problems, all given O = o . . .

◮ Marginal inference: what is the marginal distribution over

Q ⊂ U? (p(Q | o), marginalizing out the rest.)

◮ Related: draw samples from that distribution.

◮ Most probable explanation (MPE): what is the most

probable assignment to U? (argmaxu p(u | o))

32 / 77

slide-33
SLIDE 33

Inference

Most general definition: “reason about some variables, optionally given values of some others.” Let O be the observed variables and U be the unobserved ones; V = O ∪ U. Three inference problems, all given O = o . . .

◮ Marginal inference: what is the marginal distribution over

Q ⊂ U? (p(Q | o), marginalizing out the rest.)

◮ Related: draw samples from that distribution.

◮ Most probable explanation (MPE): what is the most

probable assignment to U? (argmaxu p(u | o))

◮ Related: what is the most dangerous assignment to U? 33 / 77

slide-34
SLIDE 34

Inference

Most general definition: “reason about some variables, optionally given values of some others.” Let O be the observed variables and U be the unobserved ones; V = O ∪ U. Three inference problems, all given O = o . . .

◮ Marginal inference: what is the marginal distribution over

Q ⊂ U? (p(Q | o), marginalizing out the rest.)

◮ Related: draw samples from that distribution.

◮ Most probable explanation (MPE): what is the most

probable assignment to U? (argmaxu p(u | o))

◮ Related: what is the most dangerous assignment to U?

◮ Maximum a posteriori (MAP): what is the most probable

assignment to Q ⊂ U? (argmaxq p(q | o))

34 / 77

slide-35
SLIDE 35

Inference

Most general definition: “reason about some variables, optionally given values of some others.” Let O be the observed variables and U be the unobserved ones; V = O ∪ U. Three inference problems, all given O = o . . .

◮ Marginal inference: what is the marginal distribution over

Q ⊂ U? (p(Q | o), marginalizing out the rest.)

◮ Related: draw samples from that distribution.

◮ Most probable explanation (MPE): what is the most

probable assignment to U? (argmaxu p(u | o))

◮ Related: what is the most dangerous assignment to U?

◮ Maximum a posteriori (MAP): what is the most probable

assignment to Q ⊂ U? (argmaxq p(q | o))

◮ Related: what values of Q have the lowest expected cost? 35 / 77

slide-36
SLIDE 36

Marginal Inference

Given a factor graph with variables V , find the marginal distribution over some Vi ∈ V , p(Vi). Simple chain example, focusing on i = 4:

V1 V2 V3 V4

V1 fV1 1 V1 V2 fV1,V2 1 1 1 1 V2 V3 fV2,V3 1 1 1 1 V3 V4 fV3,V4 1 1 1 1

36 / 77

slide-37
SLIDE 37

Observations

◮ If we had a single fV4, we could easily renormalize it to get

p(V4).

37 / 77

slide-38
SLIDE 38

Observations

◮ If we had a single fV4, we could easily renormalize it to get

p(V4).

◮ Correct: fV4 =

  • V1
  • V2
  • V3

fV1 · fV1,V2 · fV2,V3 · fV3,V4

38 / 77

slide-39
SLIDE 39

Observations

◮ If we had a single fV4, we could easily renormalize it to get

p(V4).

◮ Correct: fV4 =

  • V1
  • V2
  • V3

fV1 · fV1,V2 · fV2,V3 · fV3,V4

◮ But that multiplied-out factor would have

  • i

|Val(Vi)| values!

39 / 77

slide-40
SLIDE 40

Observations

◮ If we had a single fV4, we could easily renormalize it to get

p(V4).

◮ Correct: fV4 =

  • V1
  • V2
  • V3

fV1 · fV1,V2 · fV2,V3 · fV3,V4

◮ But that multiplied-out factor would have

  • i

|Val(Vi)| values!

◮ Reorganize calculations:

  • V1
  • V2
  • V3

fV1 · fV1,V2 · fV2,V3 · fV3,V4 =

  • V3

fV3,V4 ·  

V2

fV2,V3 ·  

V1

fV1,V2 · fV1    

40 / 77

slide-41
SLIDE 41

Marginal Inference

V1 V2 V3 V4

V1 fV1 1 V1 V2 fV1,V2 1 1 1 1 V2 V3 fV2,V3 1 1 1 1 V3 V4 fV3,V4 1 1 1 1

  • V1
  • V2
  • V3

fV1 · fV1,V2 · fV2,V3 · fV3,V4 =

  • V3

fV3,V4 ·  

V2

fV2,V3 ·  

V1

fV1,V2 · fV1    

41 / 77

slide-42
SLIDE 42

Marginal Inference

V1 V2 V3 V4

V2 fV2 1 V2 V3 fV2,V3 1 1 1 1 V3 V4 fV3,V4 1 1 1 1

  • V1
  • V2
  • V3

fV1 · fV1,V2 · fV2,V3 · fV3,V4 =

  • V3

fV3,V4 ·  

V2

fV2,V3 · fV2  

42 / 77

slide-43
SLIDE 43

Marginal Inference

V1 V2 V3 V4

V3 fV3 1 V3 V4 fV3,V4 1 1 1 1

  • V1
  • V2
  • V3

fV1 · fV1,V2 · fV2,V3 · fV3,V4 =

  • V3

fV3,V4 · fV3

43 / 77

slide-44
SLIDE 44

Marginal Inference

V1 V2 V3 V4

V4 fV4 1

  • V1
  • V2
  • V3

fV1 · fV1,V2 · fV2,V3 · fV3,V4 = fV4

44 / 77

slide-45
SLIDE 45

Variable Elimination

Given a factor graph with factors f, eliminate variable V .

  • 1. Let f elim ⊂ f be the factors connected to V
  • 2. Let f keep = f \ f elim be the rest
  • 3. Let fnew =
  • V
  • f∈f elim

f

  • 4. Return f keep ∪ {fnew}

Uses the graph structure to avoid exponential blowup; this is an example of dynamic programming.

45 / 77

slide-46
SLIDE 46

Marginal Inference by Variable Elimination (No Evidence)

Given a factor graph with variables V and factors f, find the marginal distribution over some V keep ⊂ V .

  • 1. Order the variables in V \ V keep.
  • 2. For each V ∈ V \ V keep:

◮ Eliminate V ; i.e., remove factors connected to V and replace

with the derived fnew.

The resulting factor graph is proportional to p(V keep).

46 / 77

slide-47
SLIDE 47

Marginal Inference by Variable Elimination (No Evidence)

Given a factor graph with variables V and factors f, find the marginal distribution over some V keep ⊂ V .

  • 1. Order the variables in V \ V keep.

The ordering can make a huge difference!

  • 2. For each V ∈ V \ V keep:

◮ Eliminate V ; i.e., remove factors connected to V and replace

with the derived fnew.

The resulting factor graph is proportional to p(V keep).

47 / 77

slide-48
SLIDE 48

A Less Good Ordering

V1 V2 V3 V4

  • V1
  • V2
  • V3

fV1 · fV1,V2 · fV2,V3 · fV3,V4 =

  • V1

fV1 ·  

V2

fV1,V2 ·  

V3

fV2,V3 · fV3,V4    

48 / 77

slide-49
SLIDE 49

A Less Good Ordering

V1 V2 V3 V4

  • V1
  • V2
  • V3

fV1 · fV1,V2 · fV2,V3 · fV3,V4 =

  • V1

fV1 ·  

V2

fV1,V2 ·  

V3

fV2,V3 · fV3,V4     =

  • V1

fV1 ·  

V2

fV1,V2 · fV2,V4  

49 / 77

slide-50
SLIDE 50

What About Evidence?

Original problem: given O = o, what is the marginal distribution

  • ver Q ⊂ U? (I.e., p(Q | O = o).)

V O Q

50 / 77

slide-51
SLIDE 51

What About Evidence?

Original problem: given O = o, what is the marginal distribution

  • ver Q ⊂ U? (I.e., p(Q | O = o).)

This adds a step at the beginning: reduce factors to “respect the evidence.”

51 / 77

slide-52
SLIDE 52

What About Evidence?

Original problem: given O = o, what is the marginal distribution

  • ver Q ⊂ U? (I.e., p(Q | O = o).)

This adds a step at the beginning: reduce factors to “respect the evidence.” This will remind you of a select . . . where operation in a database.

52 / 77

slide-53
SLIDE 53

Marginal Inference

Suppose V1 is observed to take value 1.

V1 V2 V3 V4

V1 fV1 1 V1 V2 fV1,V2 1 1 1 1 V2 V3 fV2,V3 1 1 1 1 V3 V4 fV3,V4 1 1 1 1

53 / 77

slide-54
SLIDE 54

Marginal Inference

Suppose V1 is observed to take value 1.

V1 V2 V3 V4

V1 fV1 1 V1 V2 fV1,V2 1 1 1 1 V2 V3 fV2,V3 1 1 1 1 V3 V4 fV3,V4 1 1 1 1

54 / 77

slide-55
SLIDE 55

Marginal Inference

Suppose V1 is observed to take value 1.

V1 V2 V3 V4

V1 fV1 1 V1 V2 fV1,V2 1 1 1 1 V2 V3 fV2,V3 1 1 1 1 V3 V4 fV3,V4 1 1 1 1

Note that fV1 is now a constant; since we renormalize at the end, we can ignore it. Observed nodes may create a “separation” between variables of interest and some factors.

55 / 77

slide-56
SLIDE 56

Marginal Inference by Variable Elimination with Evidence

Given a factor graph with variables V and factors f, and given O = o (where O ⊂ V ), find the marginal distribution over Q ⊆ U = V \ O.

  • 1. Reduce factors connected to O to respect the evidence.
  • 2. Order the variables in U \ Q.
  • 3. For each V ∈ U \ Q:

◮ Eliminate V ; i.e., remove factors connected to V and replace

with the derived fnew.

The resulting factor graph is proportional to p(Q | O = o).

56 / 77

slide-57
SLIDE 57

Remarks on Computational Complexity

In general, denser graphs are more expensive. Runtime and space depend on the size of the original and intermediate factors. (This is why ordering matters so much.) Finding the best ordering is NP-hard. Certain graphical structures allow inference in linear time with respect to the size of the original factors.

◮ Bayesian networks: polytrees ◮ Markov networks: chordal graphs

57 / 77

slide-58
SLIDE 58

Return to Hidden Markov Models

◮ Hidden Markov models are not (quite) Bayesian networks.

58 / 77

slide-59
SLIDE 59

Return to Hidden Markov Models

◮ Hidden Markov models are not (quite) Bayesian networks.

◮ Given an observed sequence x, however, an HMM provides a

pattern to construct a Bayesian network.

59 / 77

slide-60
SLIDE 60

Return to Hidden Markov Models

◮ Hidden Markov models are not (quite) Bayesian networks.

◮ Given an observed sequence x, however, an HMM provides a

pattern to construct a Bayesian network.

◮ Sometimes called “dynamic graphical models.” 60 / 77

slide-61
SLIDE 61

Return to Hidden Markov Models

◮ Hidden Markov models are not (quite) Bayesian networks.

◮ Given an observed sequence x, however, an HMM provides a

pattern to construct a Bayesian network.

◮ Sometimes called “dynamic graphical models.”

◮ Marginal inference for every Yi in an HMM can be

accomplished by variable elimination.

61 / 77

slide-62
SLIDE 62

Return to Hidden Markov Models

◮ Hidden Markov models are not (quite) Bayesian networks.

◮ Given an observed sequence x, however, an HMM provides a

pattern to construct a Bayesian network.

◮ Sometimes called “dynamic graphical models.”

◮ Marginal inference for every Yi in an HMM can be

accomplished by variable elimination.

◮ All variables share some computation with those to their right

and those to their left.

62 / 77

slide-63
SLIDE 63

Return to Hidden Markov Models

◮ Hidden Markov models are not (quite) Bayesian networks.

◮ Given an observed sequence x, however, an HMM provides a

pattern to construct a Bayesian network.

◮ Sometimes called “dynamic graphical models.”

◮ Marginal inference for every Yi in an HMM can be

accomplished by variable elimination.

◮ All variables share some computation with those to their right

and those to their left.

◮ This is called the forward-backward algorithm. 63 / 77

slide-64
SLIDE 64

Return to Hidden Markov Models

◮ Hidden Markov models are not (quite) Bayesian networks.

◮ Given an observed sequence x, however, an HMM provides a

pattern to construct a Bayesian network.

◮ Sometimes called “dynamic graphical models.”

◮ Marginal inference for every Yi in an HMM can be

accomplished by variable elimination.

◮ All variables share some computation with those to their right

and those to their left.

◮ This is called the forward-backward algorithm. ◮ This is useful when we want to apply EM to HMMs

(unsupervised sequence modeling).

64 / 77

slide-65
SLIDE 65

Return to Hidden Markov Models

◮ Hidden Markov models are not (quite) Bayesian networks.

◮ Given an observed sequence x, however, an HMM provides a

pattern to construct a Bayesian network.

◮ Sometimes called “dynamic graphical models.”

◮ Marginal inference for every Yi in an HMM can be

accomplished by variable elimination.

◮ All variables share some computation with those to their right

and those to their left.

◮ This is called the forward-backward algorithm. ◮ This is useful when we want to apply EM to HMMs

(unsupervised sequence modeling).

◮ It is also useful in supervised learning. 65 / 77

slide-66
SLIDE 66

Related Topics

◮ Conditional random fields ◮ MPE inference ◮ MAP inference ◮ Inexact inference

66 / 77

slide-67
SLIDE 67

Conditional Random Fields (Sequence Version)

Lafferty et al. (2001)

A nice confluence:

◮ Probabilistic graphical model-style reasoning, as in HMMs. ◮ Discriminative training, as with structured perceptron.

Local factors: fi(x, y, y′) = exp (w · φ(x, i, y, y′)) Log loss, where the graphical model parameterizes the probability distribution:

n

  • i=1

log

  • y∈Lℓi+1

exp  w ·

ℓi+1

  • j=1

φ(xi, j, yj, yj−1)  

  • fear

− w ·

ℓi+1

  • j=1

φ(xi, j, yi j, yi j−1)

  • hope

67 / 77

slide-68
SLIDE 68

Conditional Random Fields (General Version)

Factor graph consisting of “input” variables X (always observed) and “output” variables Y . p(Y = y | X = x) =

  • j fj(x, yIj)
  • y′∈Val(Y )
  • j

fj(x, y′

Ij)

MLE:

n

  • i=1

log

  • y∈Val(Y )
  • j

fj(xi, yIj)

  • fear

− log

  • j

fj(xi, yi Ij)

  • hope

Marginal inference is required for calculating the left term and its gradient with respect to w.

68 / 77

slide-69
SLIDE 69

MPE Inference

argmax

u∈Val(U)

p(U = u | O = o)

69 / 77

slide-70
SLIDE 70

MPE Inference

argmax

u∈Val(U)

p(U = u | O = o) Variable elimination and exact inference are identical to the marginal case!

70 / 77

slide-71
SLIDE 71

MPE Inference

argmax

u∈Val(U)

p(U = u | O = o) Variable elimination and exact inference are identical to the marginal case! Just replace each sum operation with a max operation, and add bookkeeping to recover the most probable assignment.

71 / 77

slide-72
SLIDE 72

MPE Inference

argmax

u∈Val(U)

p(U = u | O = o) Variable elimination and exact inference are identical to the marginal case! Just replace each sum operation with a max operation, and add bookkeeping to recover the most probable assignment. The Viterbi algorithm is, of course, an instance of this. Each “si(∗)” is an intermediate factor.

72 / 77

slide-73
SLIDE 73

MPE Inference

argmax

u∈Val(U)

p(U = u | O = o) Variable elimination and exact inference are identical to the marginal case! Just replace each sum operation with a max operation, and add bookkeeping to recover the most probable assignment. The Viterbi algorithm is, of course, an instance of this. Each “si(∗)” is an intermediate factor. Specifically for sequence models, it should be clear how factors/features that depend on the observed sequence X don’t affect the asymptotics of exact inference.

73 / 77

slide-74
SLIDE 74

Rocket Science: True MAP

Given a factor graph with variables V and factors f, and given O = o (where O ⊂ V ), find the most probable assignment of Q ⊂ U = V \ O. Let R = U \ Q. argmax

q∈Val(Q)

p(Q = q | O = o) = argmax

q∈Val(Q)

  • r∈Val(R)

p(Q = q, R = r | O = o) Solution: first use marginal inference to eliminate R, then use max inference to solve for Q.

74 / 77

slide-75
SLIDE 75

Alternative Inference Methods

Huge range of techniques! Exact:

◮ Integer linear programming

Inexact:

◮ randomized (e.g., Gibbs sampling, importance sampling,

simulated annealing)

◮ deterministic (e.g., mean field variational, loopy belief

propagation, linear programming relaxations, dual decomposition, beam search)

75 / 77

slide-76
SLIDE 76

Readings and Reminders

◮ Koller et al. (2007) ◮ Submit a suggestion for an exam question by Friday at 5pm. ◮ Your project is due March 9.

76 / 77

slide-77
SLIDE 77

References I

Daphne Koller, Nir Friedman, Lise Getoor, and Ben Taskar. Graphical models in a nutshell, 2007. URL http://www.seas.upenn.edu/~taskar/pubs/gms-srl07.pdf. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML, 2001.

77 / 77