Inference: Graph Search CS 6355: Structured Prediction 1 So far in - - PowerPoint PPT Presentation

inference graph search
SMART_READER_LITE
LIVE PREVIEW

Inference: Graph Search CS 6355: Structured Prediction 1 So far in - - PowerPoint PPT Presentation

Inference: Graph Search CS 6355: Structured Prediction 1 So far in the class Thinking about structures A graph, a collection of parts that are labeled jointly, a collection of decisions Algorithms for learning Local learning


slide-1
SLIDE 1

CS 6355: Structured Prediction

Inference: Graph Search

1

slide-2
SLIDE 2

So far in the class

  • Thinking about structures

– A graph, a collection of parts that are labeled jointly, a collection of decisions

  • Algorithms for learning

– Local learning

  • Learn parameters for individual components independently
  • Learning algorithm not aware of the full structure

– Global learning

  • Learn parameters for the full structure
  • Learning algorithm “knows” about the full structure
  • Next: Prediction

– Sets structured prediction apart from binary/multiclass

2

slide-3
SLIDE 3

Inference

  • What is inference?

– An overview of what we have seen before – Combinatorial optimization – Different views of inference

  • Graph algorithms

– Dynamic programming, greedy algorithms, search

  • Integer programming
  • Heuristics for inference

– Sampling

  • Learning to search

3

slide-4
SLIDE 4

Inference

  • What is inference?

– An overview of what we have seen before – Combinatorial optimization – Different views of inference

  • Graph algorithms

– Dynamic programming, greedy algorithms, search

  • Integer programming
  • Heuristics for inference

– Sampling

  • Learning to search

4

slide-5
SLIDE 5

Variable elimination: Max-product

We have a collection of inference variables that need to be assigned 𝐳 = (𝑧1, 𝑧2, !)

5

slide-6
SLIDE 6

Variable elimination: Max-product

We have a collection of inference variables that need to be assigned 𝐳 = (𝑧1, 𝑧2, !) General algorithm – First fix an ordering of the variables, say (𝑧1, 𝑧2, !) – Iteratively:

  • Find the best value for yi given the values of the

previous neighbors – Use back pointers to find final answer

6

slide-7
SLIDE 7

Variable elimination: Max-product

We have a collection of inference variables that need to be assigned 𝐳 = (𝑧1, 𝑧2, !) General algorithm – First fix an ordering of the variables, say (𝑧1, 𝑧2, !) – Iteratively:

  • Find the best value for yi given the values of the

previous neighbors – Use back pointers to find final answer

7

Viterbi is an instance of max-product variable elimination

slide-8
SLIDE 8

Variable elimination example

8

y2 y3 y1 yn …

A B C D A B C D A B C D

emissions 𝑧0 transitions(𝑧0, 𝑧4)

slide-9
SLIDE 9

Variable elimination example

9

y2 y3 y1 yn …

A B C D A B C D A B C D

score-local 𝑧8, 𝑧890 = emissions 𝑧890 + transitions(𝑧8, 𝑧890) emissions 𝑧0 transitions(𝑧0, 𝑧4)

slide-10
SLIDE 10

Variable elimination example

10

y2 y3 y1 yn … First eliminate y1

A B C D A B C D A B C D

score4 𝑧4 = max

<=

score0 𝑧0 + score-local 𝑧0, 𝑧4 score-local 𝑧8, 𝑧890 = emissions 𝑧890 + transitions(𝑧8, 𝑧890) emissions 𝑧0 transitions(𝑧0, 𝑧4)

slide-11
SLIDE 11

Variable elimination example

11

y2 y3 yn …

A B C D A B C D A B C D

score-local 𝑧8, 𝑧890 = emissions 𝑧890 + transitions(𝑧8, 𝑧890) score4 𝑧4 transitions(𝑧4, 𝑧>)

slide-12
SLIDE 12

Variable elimination example

12

y2 y3 yn …

A B C D A B C D A B C D

Next eliminate y2 score> 𝑧> = max

<?

score4 𝑧4 + score-local 𝑧4, 𝑧> score-local 𝑧8, 𝑧890 = emissions 𝑧890 + transitions(𝑧8, 𝑧890) score4 𝑧4 transitions(𝑧4, 𝑧>)

slide-13
SLIDE 13

Variable elimination example

13

y3 yn …

A B C D A B C D A B C D

score-local 𝑧8, 𝑧890 = emissions 𝑧890 + transitions(𝑧8, 𝑧890) score> 𝑧> transitions(𝑧>, 𝑧@)

slide-14
SLIDE 14

Variable elimination example

14

y3 yn …

A B C D A B C D A B C D

Next eliminate y3 score@ 𝑧@ = max

<A

score> 𝑧> + score-local 𝑧>, 𝑧@ score-local 𝑧8, 𝑧890 = emissions 𝑧890 + transitions(𝑧8, 𝑧890) score> 𝑧> transitions(𝑧>, 𝑧@)

slide-15
SLIDE 15

Variable elimination example

15

yn

A B C D

We have all the information to make a decision for yn scoreB 𝑧C After n such steps

slide-16
SLIDE 16

Variable elimination: Max-product

We have a collection of inference variables that need to be assigned 𝐳 = (𝑧1, 𝑧2, !) General algorithm – First fix an ordering of the variables, say (𝑧1, 𝑧2, !) – Iteratively:

  • Find the best value for yi given the values of the

previous neighbors – Use back pointers to find final answer

16

Viterbi is an instance of max-product variable elimination

slide-17
SLIDE 17

Variable elimination: Max-product

We have a collection of inference variables that need to be assigned 𝐳 = (𝑧1, 𝑧2, !) General algorithm – First fix an ordering of the variables, say (𝑧1, 𝑧2, !) – Iteratively:

  • Find the best value for yi given the values of the

previous neighbors – Use back pointers to find final answer

17

Viterbi is an instance of max-product variable elimination Challenge: What makes a good order?

slide-18
SLIDE 18

Max-product algorithm

  • Where is the “product” in max-product?

𝐱E𝜚 𝐲, 𝐳 = H score-local(𝑧8, 𝑧890)

  • 8
  • Generalizes beyond sequence models

– Requires a clever ordering of the output variables – Exact inference when the output is a tree

  • If not, no guarantees
  • Also works for summing over all structures

– Sum-product message passing – Belief propagation

18

slide-19
SLIDE 19

Max-product algorithm

  • Where is the “product” in max-product?

𝐱E𝜚 𝐲, 𝐳 = H score-local(𝑧8, 𝑧890)

  • 8
  • Generalizes beyond sequence models

– Requires a clever ordering of the output variables – Exact inference when the output is a tree

  • If not, no guarantees
  • Also works for summing over all structures

– Sum-product message passing – Belief propagation

19

slide-20
SLIDE 20

Dynamic programming

  • General solution strategy for inference
  • Examples

– Viterbi, CKY algorithm, Dijkstra’s algorithm, and many more

  • Key ideas:

– Memoization: Don’t re-compute something you already have – Requires an ordering of the variables

  • Remember:

– The hypergraph may not allow for the best ordering of the variables – Existence of a dynamic programming algorithm does not mean polynomial time/space.

  • State space may be too big. Use heuristics such as beam search

20

slide-21
SLIDE 21

Graph algorithms for inference

  • Many graph algorithms you have seen are applicable

for inference

  • Some examples

– “Best” path. Eg: Viterbi, parsing – Min-cut/max-flow. Eg: Image segmentation – Maximum spanning tree. Eg: Dependency parsing – Bipartite matching. Eg: Aligning sequences

21

slide-22
SLIDE 22

Best path for inference

  • Broad description of approach:

– Construct a graph/hypergraph from the input and output – Decompose the total score along edge/hyperedges – Inference is finding the shortest/longest path in this weighted graph Viterbi algorithm finds a shortest path in a specific graph!

22

slide-23
SLIDE 23

Viterbi algorithm as best path

23

Goal: To find the highest scoring path in this trellis Time steps Different labels for each step

slide-24
SLIDE 24

Viterbi algorithm as best path

24

Goal: To find the highest scoring path in this trellis Different labels for each step

slide-25
SLIDE 25

Viterbi algorithm as best path

25

Goal: To find the highest scoring path in this trellis No cycles Nodes and edges have a specific meaning Ordering helps Different labels for each step

slide-26
SLIDE 26

Best path algorithms

  • Dijkstra’s algorithm

– Cost functions should be non-negative

  • Bellman-ford algorithm

– Slower than Dijkstra’s algorithm but works with negative weights

  • A* search

– If you have a heuristic that gives the future path cost from a state but does not over-estimate it

26

slide-27
SLIDE 27

Inference as search: Setting

  • Predicting a graph as a sequence of decisions
  • Data structures:

– State: Encodes partial structure – Transitions: Move from one partial structure to another – Start state – End state: We have a full structure

  • There may be more than one end state
  • Each transition is scored with the learned model
  • Goal: Find an end state that has the highest total score

27

slide-28
SLIDE 28

Example

28

x1 x2 x3 y3 y2 y1

  • State: Triples (y1, y2, y3) all possibly unknown
  • (A, -, -), (-, A, A), (-, -, -),…
  • Transition: Fill in one of the unknowns
  • Start state: (-,-,-)
  • End state: All three y’s are assigned

Suppose each y can be one

  • f A, B or C
slide-29
SLIDE 29

Example

29

x1 x2 x3 y3 y2 y1

  • State: Triples (y1, y2, y3) all possibly unknown
  • (A, -, -), (-, A, A), (-, -, -),…
  • Transition: Fill in one of the unknowns
  • Start state: (-,-,-)
  • End state: All three y’s are assigned

(-,-,-) Suppose each y can be one

  • f A, B or C

Start state: No assignments

slide-30
SLIDE 30

Example

30

x1 x2 x3 y3 y2 y1

  • State: Triples (y1, y2, y3) all possibly unknown
  • (A, -, -), (-, A, A), (-, -, -),…
  • Transition: Fill in one of the unknowns
  • Start state: (-,-,-)
  • End state: All three y’s are assigned

(-,-,-) (A,-,-) (B,-,-) (C,-,-) Suppose each y can be one

  • f A, B or C

Fill in a label in a slot. The edge is scored by the factors that can be computed so far

slide-31
SLIDE 31

Example

31

x1 x2 x3 y3 y2 y1

  • State: Triples (y1, y2, y3) all possibly unknown
  • (A, -, -), (-, A, A), (-, -, -),…
  • Transition: Fill in one of the unknowns
  • Start state: (-,-,-)
  • End state: All three y’s are assigned

(-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-) ….. Suppose each y can be one

  • f A, B or C

Keep assigning values to slots

slide-32
SLIDE 32

Example

32

x1 x2 x3 y3 y2 y1

  • State: Triples (y1, y2, y3) all possibly unknown
  • (A, -, -), (-, A, A), (-, -, -),…
  • Transition: Fill in one of the unknowns
  • Start state: (-,-,-)
  • End state: All three y’s are assigned

(-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-) (A,A,A) (C,C,C) ….. Suppose each y can be one

  • f A, B or C

Till we reach a goal state

slide-33
SLIDE 33

Example

33

x1 x2 x3 y3 y2 y1 Suppose each y can be one

  • f A, B or C
  • State: Triples (y1, y2, y3) all possibly unknown
  • (A, -, -), (-, A, A), (-, -, -),…
  • Transition: Fill in one of the unknowns
  • Start state: (-,-,-)
  • End state: All three y’s are assigned

(-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-) (A,A,A) (C,C,C) ….. Note: Here we have assumed an

  • rdering (y1, y2, y3)
slide-34
SLIDE 34

Example

34

x1 x2 x3 y3 y2 y1 Suppose each y can be one

  • f A, B or C
  • State: Triples (y1, y2, y3) all possibly unknown
  • (A, -, -), (-, A, A), (-, -, -),…
  • Transition: Fill in one of the unknowns
  • Start state: (-,-,-)
  • End state: All three y’s are assigned

(-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-) (A,A,A) (C,C,C) ….. Note: Here we have assumed an

  • rdering (y1, y2, y3)

How do the transitions get scored?

slide-35
SLIDE 35

Example

35

x1 x2 x3 y3 y2 y1 Suppose each y can be one

  • f A, B or C
  • State: Triples (y1, y2, y3) all possibly unknown
  • (A, -, -), (-, A, A), (-, -, -),…
  • Transition: Fill in one of the unknowns
  • Start state: (-,-,-)
  • End state: All three y’s are assigned

(-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-) (A,A,A) (C,C,C) ….. The goal of inference: To traverse this graph from the start state and reach the end state that has the best (highest/lowest) score

slide-36
SLIDE 36

Graph search algorithms

  • Standard graph search algorithms can be used for inference
  • Breadth/depth first search

– Keep a stack/queue/priority queue of “open” states

  • That is, states that are to be explored

– The good: Guaranteed to be correct

  • Explores every option

– The bad?

  • Explores every option: Memory is an issue
  • Could be slow for any non-trivial graph

36

slide-37
SLIDE 37

Greedy search

  • At each state, choose the highest scoring next transition

– Keep only one state in memory: The current state

  • What is the problem?

– Local decisions may override global optimum – Does not explore full search space

  • Greedy algorithms can give the true optimum for special

classes of problems

– Eg: Maximum-spanning tree algorithms are greedy

37

Questions?

slide-38
SLIDE 38

Beam search: A compromise

  • Keep size-limited priority queue of states

– Called the beam, sorted by total score for the state

  • At each step:

– Explore all transitions from the current state – Add all to beam and trim the size

38

slide-39
SLIDE 39

Beam search: A compromise

  • Keep size-limited priority queue of states

– Called the beam, sorted by total score for the state

  • At each step:

– Explore all transitions from the current state – Add all to beam and trim the size

39

Example: Suppose we have a beam of size k = 2

slide-40
SLIDE 40

Beam search: A compromise

  • Keep size-limited priority queue of states

– Called the beam, sorted by total score for the state

  • At each step:

– Explore all transitions from the current state – Add all to beam and trim the size

40

Example: Suppose we have a beam of size k = 2 (−, −, −) At the beginning, the beam has

  • nly one element, the start state
slide-41
SLIDE 41

Beam search: A compromise

  • Keep size-limited priority queue of states

– Called the beam, sorted by total score for the state

  • At each step:

– Explore all transitions from the current state – Add all to beam and trim the size

41

Example: Suppose we have a beam of size k = 2 (−, −, −) Expand all the states in the beam (A, −, −) (B, −, −) (C, −, −)

slide-42
SLIDE 42

Beam search: A compromise

  • Keep size-limited priority queue of states

– Called the beam, sorted by total score for the state

  • At each step:

– Explore all transitions from the current state – Add all to beam and trim the size

42

Example: Suppose we have a beam of size k = 2 (−, −, −) Expand all the states in the beam Score the newly created states (A, −, −) (B, −, −) (C, −, −) 0.9 10

  • 3
slide-43
SLIDE 43

Beam search: A compromise

  • Keep size-limited priority queue of states

– Called the beam, sorted by total score for the state

  • At each step:

– Explore all transitions from the current state – Add all to beam and trim the size

43

Example: Suppose we have a beam of size k = 2 (−, −, −) Expand all the states in the beam Score the newly created states (A, −, −) (B, −, −) (C, −, −) 0.9 10

  • 3
slide-44
SLIDE 44

Beam search: A compromise

  • Keep size-limited priority queue of states

– Called the beam, sorted by total score for the state

  • At each step:

– Explore all transitions from the current state – Add all to beam and trim the size

44

Example: Suppose we have a beam of size k = 2 (−, −, −) Expand all the states in the beam Score the newly created states The top k new states form the new beam (sorted) (A, −, −) (𝐶, −, −) (𝐷, −, −) 0.9 10

  • 3
slide-45
SLIDE 45

Beam search: A compromise

  • Keep size-limited priority queue of states

– Called the beam, sorted by total score for the state

  • At each step:

– Explore all transitions from the current state – Add all to beam and trim the size

45

Example: Suppose we have a beam of size k = 2 (−, −, −) Expand all the states in the beam Score the newly created states The top k new states form the new beam (sorted) (B, −, −) (A, −, −)

slide-46
SLIDE 46

Beam search: A compromise

  • Keep size-limited priority queue of states

– Called the beam, sorted by total score for the state

  • At each step:

– Explore all transitions from the current state – Add all to beam and trim the size

46

Example: Suppose we have a beam of size k = 2 (−, −, −) Expand all the states in the beam Score the newly created states The top k new states form the new beam (sorted) (B, −, −) (A, −, −) Now we are ready for the next step

slide-47
SLIDE 47

Beam search: A compromise

  • Keep size-limited priority queue of states

– Called the beam, sorted by total score for the state

  • At each step:

– Explore all transitions from the current state – Add all to beam and trim the size

47

Example: Suppose we have a beam of size k = 2 (−, −, −) Expand all the states in the beam (B, −, −) (A, −, −) B, A, − (B, B, −) (B, C, −) (A, A, −) (A, B, −) (A, C, −)

slide-48
SLIDE 48

Beam search: A compromise

  • Keep size-limited priority queue of states

– Called the beam, sorted by total score for the state

  • At each step:

– Explore all transitions from the current state – Add all to beam and trim the size

48

Example: Suppose we have a beam of size k = 2 (−, −, −) Expand all the states in the beam Score the newly created states (B, −, −) (A, −, −) B, A, − (B, B, −) (B, C, −) (A, A, −) (A, B, −) (A, C, −) 0.1

  • 3

10 20

  • 1

4.1

slide-49
SLIDE 49

Beam search: A compromise

  • Keep size-limited priority queue of states

– Called the beam, sorted by total score for the state

  • At each step:

– Explore all transitions from the current state – Add all to beam and trim the size

49

Example: Suppose we have a beam of size k = 2 (−, −, −) Expand all the states in the beam Score the newly created states The top k new states form the new beam (sorted) (B, −, −) (A, −, −) 0.1

  • 3

10 20

  • 1

4.1 B, A, − (B, B, −) (B, C, −) (A, A, −) (A, B, −) (A, C, −)

slide-50
SLIDE 50

Beam search: A compromise

  • Keep size-limited priority queue of states

– Called the beam, sorted by total score for the state

  • At each step:

– Explore all transitions from the current state – Add all to beam and trim the size

50

Example: Suppose we have a beam of size k = 2 (−, −, −) Expand all the states in the beam Score the newly created states The top k new states form the new beam (sorted) (B, −, −) (A, −, −) (A, A, −) (B, C, −)

slide-51
SLIDE 51

Beam search: A compromise

  • Keep size-limited priority queue of states

– Called the beam, sorted by total score for the state

  • At each step:

– Explore all transitions from the current state – Add all to beam and trim the size

51

Example: Suppose we have a beam of size k = 2 (−, −, −) (B, −, −) (A, −, −) (A, A, −) (B, C, −) (A, A, B) (B, C, C)

slide-52
SLIDE 52

Beam search: A compromise

  • Keep size-limited priority queue of states

– Called the beam, sorted by total score for the state

  • At each step:

– Explore all transitions from the current state – Add all to beam and trim the size

52

Example: Suppose we have a beam of size k = 2 (−, −, −) (B, −, −) (A, −, −) (A, A, −) (B, C, −) (𝐵, 𝐵, 𝐶) (B, C, C) Final answer: Top of the beam at the end of search

slide-53
SLIDE 53

Beam search: A compromise

  • Keep size-limited priority queue of states

– Called the beam, sorted by total score for the state

  • At each step:

– Explore all transitions from the current state – Add all to beam and trim the size

  • The good: Explores more than greedy search

– Greedy search is beam search with beam size 1

  • The bad: A good state might fall out of the beam
  • In general, easy to implement, very popular

– No guarantees

53

slide-54
SLIDE 54

Beam search: A compromise

  • Keep size-limited priority queue of states

– Called the beam, sorted by total score for the state

  • At each step:

– Explore all transitions from the current state – Add all to beam and trim the size

  • The good: Explores more than greedy search

– Greedy search is beam search with beam size 1

  • The bad: A good state might fall out of the beam
  • In general, easy to implement, very popular

– No guarantees

54

Questions?

slide-55
SLIDE 55

Summary: Inference as graph search

  • MAP inference with discrete random variables involves

finding a score maximizing assignment to variables

  • We can incrementally construct such an assignment

using graph algorithms

– Many inference algorithms are efficient dynamic programming formulations – General graph search is also helpful

  • Popular heuristics in this family of methods:

– Greedy search – Beam search

55