[PPT] - Advanced Dynamic Programming in CL: Theory, Algorithms, and PowerPoint Presentation

SLIDE 1

Advanced Dynamic Programming in CL:

Theory, Algorithms, and Applications

Liang Huang

University of Pennsylvania

(S, 0, n) w0 w1 ... wn-1

SLIDE 2

Liang Huang (Penn) Dynamic Programming

A Little Bit of History...

2

SLIDE 3

Liang Huang (Penn) Dynamic Programming

A Little Bit of History...

Who invented Dynamic Programming?

and when was it invented?

2

SLIDE 4

Liang Huang (Penn) Dynamic Programming

A Little Bit of History...

Who invented Dynamic Programming?

and when was it invented?

R. Bellman (1940s-50s)
A. Viterbi (1967)
E. Dijkstra (1959)
Hart, Nilsson, and Raphael (1968)
Dijkstra => A* Algorithm
D. Knuth (1977)
Dijkstra on Grammar (Hypergraph)

2

Andrew Viterbi Richard Bellman

SLIDE 5

Liang Huang (Penn) Dynamic Programming

A Little Bit of History...

Who invented Dynamic Programming?

and when was it invented?

R. Bellman (1940s-50s)
A. Viterbi (1967)
E. Dijkstra (1959)
Hart, Nilsson, and Raphael (1968)
Dijkstra => A* Algorithm
D. Knuth (1977)
Dijkstra on Grammar (Hypergraph)

2

Andrew Viterbi Richard Bellman

A. Turing

SLIDE 6

Liang Huang (Penn) Dynamic Programming

Dynamic Programming

Dynamic Programming is everywhere in NLP
Viterbi Algorithm for Hidden Markov Models
CKY Algorithm for Parsing and Machine Translation
Forward-Backward and Inside-Outside Algorithms
Also everywhere in AI/ML
Reinforcement Learning, Planning (POMDP)
AI Search: Uniform-cost, A*, etc.
This tutorial: a unified theoretical view of DP
Focusing on Optimization Problems

3

SLIDE 7

Liang Huang (Penn) Dynamic Programming

Review: DP Basics

DP = Divide-and-Conquer + Two Principles:
[required] Optimal Subproblem Property
[recommended] Sharing of Common Subproblems
Structure of the Search Space
Incremental
Graph
Knapsack, Edit Dist., Sequence Alignment
Branching
Hypergraph
Matrix-Chain, Polygon Triangulation, Optimal BST

4

SLIDE 8

Liang Huang (Penn) Dynamic Programming

Two Dimensional Survey

5

topological

(acyclic)

best-first

(superior)

graphs with semirings

hypergraphs with weight functions

Viterbi Dijkstra Generalized Viterbi Knuth

traversing order search space

SLIDE 9

Liang Huang (Penn) Dynamic Programming

Graphs in NLP

6

part-of-speech tagging lattice in speech

SLIDE 10

Liang Huang (Penn) Dynamic Programming

Semirings on Graphs

in a weighted graph, we need two operators:
extension (multiplicative) and summary (additive)
the weight of a path is the product of edge weights
the weight of a vertex is the summary of path weights

7

s u

e1

v t

... ...

e2 e3

d(π1) =

ei∈π1

w(ei) = w(e1) ⊗ w(e2) ⊗ w(e3) d(t) =

πi

w(πi) = w(p1) ⊕ w(p2) ⊕ · · ·

SLIDE 11

Liang Huang (Penn) Dynamic Programming

Semiring Definitions

8

A monoid is a triple (A, ⊗, 1) where

1. ⊗ is a closed associative binary operator on the set A,
2. 1 is the identity element for ⊗, i.e., for all a ∈ A, a ⊗ 1 = 1 ⊗ a = a.

A monoid is commutative if ⊗ is commutative.

SLIDE 12

Liang Huang (Penn) Dynamic Programming

Semiring Definitions

8

A monoid is a triple (A, ⊗, 1) where

1. ⊗ is a closed associative binary operator on the set A,
2. 1 is the identity element for ⊗, i.e., for all a ∈ A, a ⊗ 1 = 1 ⊗ a = a.

A monoid is commutative if ⊗ is commutative.

([0, 1], +, 0) ([0, 1], ×, 1) ([0, 1], max, 0)

SLIDE 13

Liang Huang (Penn) Dynamic Programming

Semiring Definitions

8

A monoid is a triple (A, ⊗, 1) where

1. ⊗ is a closed associative binary operator on the set A,
2. 1 is the identity element for ⊗, i.e., for all a ∈ A, a ⊗ 1 = 1 ⊗ a = a.

A monoid is commutative if ⊗ is commutative. A semiring is a 5-tuple R = (A, ⊕, ⊗, 0, 1) such that

1. (A, ⊕, 0) is a commutative monoid.
2. (A, ⊗, 1) is a monoid.
3. ⊗ distributes over ⊕: for all a, b, c in A,

(a ⊕ b) ⊗ c = (a ⊗ c) ⊕ (b ⊗ c), c ⊗ (a ⊕ b) = (c ⊗ a) ⊕ (c ⊗ b).

4. 0 is an annihilator for ⊗: for all a in A, 0 ⊗ a = a ⊗ 0 = 0.

([0, 1], +, 0) ([0, 1], ×, 1) ([0, 1], max, 0)

SLIDE 14

Liang Huang (Penn) Dynamic Programming

Semiring Definitions

8

A monoid is a triple (A, ⊗, 1) where

1. ⊗ is a closed associative binary operator on the set A,
2. 1 is the identity element for ⊗, i.e., for all a ∈ A, a ⊗ 1 = 1 ⊗ a = a.

A monoid is commutative if ⊗ is commutative. A semiring is a 5-tuple R = (A, ⊕, ⊗, 0, 1) such that

1. (A, ⊕, 0) is a commutative monoid.
2. (A, ⊗, 1) is a monoid.
3. ⊗ distributes over ⊕: for all a, b, c in A,

(a ⊕ b) ⊗ c = (a ⊗ c) ⊕ (b ⊗ c), c ⊗ (a ⊕ b) = (c ⊗ a) ⊕ (c ⊗ b).

4. 0 is an annihilator for ⊗: for all a in A, 0 ⊗ a = a ⊗ 0 = 0.

([0, 1], +, 0) ([0, 1], ×, 1) ([0, 1], max, 0) ([0, 1], max, ×, 0, 1) ([0, 1], +, ×, 0, 1)

SLIDE 15

Liang Huang (Penn) Dynamic Programming

Semiring Definitions

8

A monoid is a triple (A, ⊗, 1) where

1. ⊗ is a closed associative binary operator on the set A,
2. 1 is the identity element for ⊗, i.e., for all a ∈ A, a ⊗ 1 = 1 ⊗ a = a.

A monoid is commutative if ⊗ is commutative. A semiring is a 5-tuple R = (A, ⊕, ⊗, 0, 1) such that

1. (A, ⊕, 0) is a commutative monoid.
2. (A, ⊗, 1) is a monoid.
3. ⊗ distributes over ⊕: for all a, b, c in A,

(a ⊕ b) ⊗ c = (a ⊗ c) ⊕ (b ⊗ c), c ⊗ (a ⊕ b) = (c ⊗ a) ⊕ (c ⊗ b).

4. 0 is an annihilator for ⊗: for all a in A, 0 ⊗ a = a ⊗ 0 = 0.

([0, 1], +, 0) ([0, 1], ×, 1) ([0, 1], max, 0) ([0, 1], max, ×, 0, 1) ([0, 1], +, ×, 0, 1)

SLIDE 16

Liang Huang (Penn) Dynamic Programming

Examples

9

Semiring Set ⊕ ⊗ 1 intuition/application Boolean {0, 1} ∨ ∧ 1 logical deduction, recognition Viterbi [0, 1] max × 1

prob. of the best derivation

Inside R+ ∪ {+∞} + × 1

prob. of a string

Real R ∪ {+∞} min + +∞ shortest-distance Tropical R+ ∪ {+∞} min + +∞ with non-negative weights Counting N + × 1 number of paths

SLIDE 17

Liang Huang (Penn) Dynamic Programming

Ordering

10

SLIDE 18

Liang Huang (Penn) Dynamic Programming

Ordering

idempotent

10

A semiring (A, ⊕, ⊗, 0, 1) is idempotent if for all a in A, a ⊕ a = a.

SLIDE 19

Liang Huang (Penn) Dynamic Programming

Ordering

idempotent
comparison
examples: boolean, viterbi, tropical, real, ...

10

A semiring (A, ⊕, ⊗, 0, 1) is idempotent if for all a in A, a ⊕ a = a. (a ≤ b) ⇔ (a ⊕ b = a) defines a partial ordering. ({0, 1}, ∨, ∧, 0, 1) (R+ ∪ {+∞}, min, +, +∞, 0) ([0, 1], max, ⊗, 0, 1) (R ∪ {+∞}, min, +, +∞, 0)

SLIDE 20

Liang Huang (Penn) Dynamic Programming

Ordering

idempotent
comparison
examples: boolean, viterbi, tropical, real, ...
total-order for optimization problems
examples: all of the above

10

A semiring (A, ⊕, ⊗, 0, 1) is idempotent if for all a in A, a ⊕ a = a. (a ≤ b) ⇔ (a ⊕ b = a) defines a partial ordering. A semiring is totally-ordered if ⊕ defines a total ordering. ({0, 1}, ∨, ∧, 0, 1) (R+ ∪ {+∞}, min, +, +∞, 0) ([0, 1], max, ⊗, 0, 1) (R ∪ {+∞}, min, +, +∞, 0)

SLIDE 21

Liang Huang (Penn) Dynamic Programming

Monotonicity

11

SLIDE 22

Liang Huang (Penn) Dynamic Programming

Monotonicity

monotonicity

11

SLIDE 23

Liang Huang (Penn) Dynamic Programming

Monotonicity

monotonicity

11

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is monotonic if for all a, b, c ∈ A (a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)

SLIDE 24

Liang Huang (Penn) Dynamic Programming

Monotonicity

monotonicity
optimal substructure in dynamic programming

11

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is monotonic if for all a, b, c ∈ A (a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)

SLIDE 25

Liang Huang (Penn) Dynamic Programming

Monotonicity

monotonicity
optimal substructure in dynamic programming

11

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is monotonic if for all a, b, c ∈ A (a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)

B: b

C: c

A: b ⊗ c

SLIDE 26

Liang Huang (Penn) Dynamic Programming

Monotonicity

monotonicity
optimal substructure in dynamic programming

11

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is monotonic if for all a, b, c ∈ A (a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)

B: b

C: c

A: b ⊗ c

B: b’ ≤ b

C: c

A: b’ ⊗ c ≤ b ⊗ c

SLIDE 27

Liang Huang (Penn) Dynamic Programming

Monotonicity

monotonicity
optimal substructure in dynamic programming
idempotent => monotone (from distributivity)

11

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is monotonic if for all a, b, c ∈ A (a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)

B: b

C: c

A: b ⊗ c

B: b’ ≤ b

C: c

A: b’ ⊗ c ≤ b ⊗ c

SLIDE 28

Liang Huang (Penn) Dynamic Programming

Monotonicity

monotonicity
optimal substructure in dynamic programming
idempotent => monotone (from distributivity)
(a+b)⊗c = (a⊗c)+(b⊗c); if a≤b, (a⊗c)=(a⊗c)+(b⊗c)

11

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is monotonic if for all a, b, c ∈ A (a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)

B: b

C: c

A: b ⊗ c

B: b’ ≤ b

C: c

A: b’ ⊗ c ≤ b ⊗ c

SLIDE 29

Liang Huang (Penn) Dynamic Programming

Monotonicity

monotonicity
optimal substructure in dynamic programming
idempotent => monotone (from distributivity)
(a+b)⊗c = (a⊗c)+(b⊗c); if a≤b, (a⊗c)=(a⊗c)+(b⊗c)
by def. of comparison, a⊗c ≤ b⊗c

11

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is monotonic if for all a, b, c ∈ A (a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)

B: b

C: c

A: b ⊗ c

B: b’ ≤ b

C: c

A: b’ ⊗ c ≤ b ⊗ c

SLIDE 30

Liang Huang (Penn) Dynamic Programming

DP on Graphs

optimization problems on graphs

=> generic shortest-path problem

weighted directed graph G=(V, E) with a function w

that assigns each edge a weight from a semiring

compute the best weight of the target vertex t
generic update along edge (u, v)
how to avoid cyclic updates?
only update when d(u) is fixed

12

v u

w(u, v)

d(v) ⊕ = d(u) ⊗ w(u, v)

d(v) ← d(v) ⊕ (d(u) ⊗ w(u, v))

SLIDE 31

Liang Huang (Penn) Dynamic Programming

Two Dimensional Survey

13

topological (acyclic) best-first (superior) graphs with semirings (e.g., FSMs)

hypergraphs with weight functions (e.g., CFGs)

Viterbi Dijkstra Generalized Viterbi Knuth

traversing order search space

SLIDE 32

Liang Huang (Penn) Dynamic Programming

Viterbi Algorithm for DAGs

1. topological sort
2. visit each vertex v in sorted order and do updates
for each incoming edge (u, v) in E
use d(u) to update d(v):
key observation: d(u) is fixed to optimal at this time
time complexity: O( V + E )

14

v u

w(u, v)

d(v) ⊕ = d(u) ⊗ w(u, v)

SLIDE 33

Liang Huang (Penn) Dynamic Programming

Variant 1: forward-update

1. topological sort
2. visit each vertex v in sorted order and do updates
for each outgoing edge (v, u) in E
use d(v) to update d(u):
key observation: d(v) is fixed to optimal at this time
time complexity: O( V + E )

15

d(u) ⊕ = d(v) ⊗ w(v, u)

v u

w(v, u)

SLIDE 34

Liang Huang (Penn) Dynamic Programming

[Number of Paths in a DAG]
just use the counting semiring (N, +, ×, 0, 1)
note: this is not an optimization problem!
[Longest Path in a DAG]
just use the semiring

16

(R ∪ {−∞}, max, +, −∞, 0)

SLIDE 39

Liang Huang (Penn) Dynamic Programming

Examples

[Number of Paths in a DAG]
just use the counting semiring (N, +, ×, 0, 1)
note: this is not an optimization problem!
[Longest Path in a DAG]
just use the semiring
[Part-of-Speech Tagging with a Hidden Markov Model]

16

(R ∪ {−∞}, max, +, −∞, 0)

SLIDE 40

Liang Huang (Penn) Dynamic Programming

Examples

[Number of Paths in a DAG]
just use the counting semiring (N, +, ×, 0, 1)
note: this is not an optimization problem!
[Longest Path in a DAG]
just use the semiring
[Part-of-Speech Tagging with a Hidden Markov Model]

16

(R ∪ {−∞}, max, +, −∞, 0)

SLIDE 41

Liang Huang (Penn) Dynamic Programming

Example: Speech Alignment

17

time complexity: O(n2) also used in:

edit distance biological sequence alignment

SLIDE 42

Liang Huang (Penn) Dynamic Programming

Example: Word Alignment

18

key difference
reorderings in translation!
sequence/speech alignment

is always monotonic

complexity under HMM
word alignment is O(n3)
for every (i, j)
enumerate all (i-1, k)
sequence alignment O(n2)

I love you . Je t’

aime .

i i-1 j k

SLIDE 43

Liang Huang (Penn) Dynamic Programming

Chinese Word Segmentation

19

下雨天地面积水

xia yu tian di mian ji shui

SLIDE 44

Liang Huang (Penn) Dynamic Programming

Chinese Word Segmentation

19

民主

min-zhu

people-dominate

“democracy”

下雨天地面积水

xia yu tian di mian ji shui

SLIDE 45

Liang Huang (Penn) Dynamic Programming

Chinese Word Segmentation

19

民主

min-zhu

people-dominate

“democracy”

江泽民主席

jiang-ze-min zhu-xi

... - ... - people dominate-podium

“President Jiang Zemin”

下雨天地面积水

xia yu tian di mian ji shui

SLIDE 46

Liang Huang (Penn) Dynamic Programming

Chinese Word Segmentation

19

民主

min-zhu

people-dominate

“democracy”

江泽民主席

jiang-ze-min zhu-xi

... - ... - people dominate-podium

“President Jiang Zemin”

this was 5 years ago. now Google is good at segmentation!

下雨天地面积水

xia yu tian di mian ji shui

SLIDE 47

Liang Huang (Penn) Dynamic Programming

Chinese Word Segmentation

19

民主

min-zhu

people-dominate

“democracy”

江泽民主席

jiang-ze-min zhu-xi

... - ... - people dominate-podium

“President Jiang Zemin”

this was 5 years ago. now Google is good at segmentation!

下雨天地面积水

xia yu tian di mian ji shui

SLIDE 48

Liang Huang (Penn) Dynamic Programming

Chinese Word Segmentation

19

民主

min-zhu

people-dominate

“democracy”

江泽民主席

jiang-ze-min zhu-xi

... - ... - people dominate-podium

“President Jiang Zemin”

this was 5 years ago. now Google is good at segmentation!

下雨天地面积水

xia yu tian di mian ji shui

graph search

SLIDE 49

Huang and Chiang Forest Rescoring

Phrase-based Decoding

20

yu Shalong juxing le huitan

与沙龙举行了会谈

with Sharon held a talk talks Sharon held

with

_ _ _ _ _ _ _●

_●

SLIDE 55

Huang and Chiang Forest Rescoring

Phrase-based Decoding

22

yu Shalong juxing le huitan

与沙龙举行了会谈

held a talk with Sharon

_ _●

held a talk

held a talk with Sharon

_ _ _ _ _

... ... ...

...

_ _●

held a talk

source-side: coverage vector target-side: grow hypotheses strictly left-to-right

space: O(2n), time: O(2n n2) -- cf. traveling salesman problem

SLIDE 56

Huang and Chiang Forest Rescoring

Traveling Salesman Problem & MT

a classical NP-hard problem
goal: visit each city once and only once
exponential-time dynamic programming
state: cities visited so far (bit-vector)
search in this O(2n) transformed graph
MT: each city is a source-language word
restrictions in reordering can reduce

complexity => distortion limit

=> syntax-based MT

23

(Held and Karp, 1962; Knight, 1999)

SLIDE 57

Huang and Chiang Forest Rescoring

Traveling Salesman Problem & MT

a classical NP-hard problem
goal: visit each city once and only once
exponential-time dynamic programming
state: cities visited so far (bit-vector)
search in this O(2n) transformed graph
MT: each city is a source-language word
restrictions in reordering can reduce

complexity => distortion limit

=> syntax-based MT

23

(Held and Karp, 1962; Knight, 1999)

SLIDE 58

Huang and Chiang Forest Rescoring

Adding a Bigram Model

“refined” graph: annotated with language model words
still dynamic programming, just larger search space

24

_ _●

... talk

_ _ _ _ _

●●●● ... Sharon

_ _●

... talks

_ _●

... meeting
●●●● ... Shalong

SLIDE 59

Huang and Chiang Forest Rescoring

Adding a Bigram Model

“refined” graph: annotated with language model words
still dynamic programming, just larger search space

24

_ _●

... talk

_ _ _ _ _

●●●● ... Sharon

_ _●

... talks

_ _●

... meeting
●●●● ... Shalong

with Sharon

SLIDE 60

Huang and Chiang Forest Rescoring

Adding a Bigram Model

“refined” graph: annotated with language model words
still dynamic programming, just larger search space

24

_ _●

... talk

_ _ _ _ _

●●●● ... Sharon

_ _●

... talks

_ _●

... meeting
●●●● ... Shalong

with Sharon

bigram

SLIDE 61

Huang and Chiang Forest Rescoring

Adding a Bigram Model

“refined” graph: annotated with language model words
still dynamic programming, just larger search space

24

_ _●

... talk

_ _ _ _ _

●●●● ... Sharon

_ _●

... talks

_ _●

... meeting
●●●● ... Shalong

with Sharon

bigram

space: O(2n), time: O(2n n2) => space: O(2n Vm-1), time: O(2n Vm-1 n2) for m-gram language models

SLIDE 62

Liang Huang (Penn) Dynamic Programming

Two Dimensional Survey

25

topological (acyclic) best-first (superior) graphs with semirings (e.g., FSMs)

hypergraphs with weight functions (e.g., CFGs)

Viterbi Dijkstra Generalized Viterbi Knuth

traversing order search space

SLIDE 63

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

26

d(u)

d(u) ⊗ w(e) w(e)

SLIDE 64

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

Dijkstra does not require acyclicity
instead of topological order, we use best-first order
but this requires superiority of the semiring
intuition: combination always gets worse

26

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is superior if for all a, b ∈ A a ≤ a ⊗ b, b ≤ a ⊗ b.

d(u)

d(u) ⊗ w(e) w(e)

SLIDE 65

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

Dijkstra does not require acyclicity
instead of topological order, we use best-first order
but this requires superiority of the semiring
intuition: combination always gets worse
contrast: monotonicity: combination preserves order

26

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is superior if for all a, b ∈ A a ≤ a ⊗ b, b ≤ a ⊗ b.

d(u)

d(u) ⊗ w(e) w(e)

SLIDE 66

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

Dijkstra does not require acyclicity
instead of topological order, we use best-first order
but this requires superiority of the semiring
intuition: combination always gets worse
contrast: monotonicity: combination preserves order

26

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is superior if for all a, b ∈ A a ≤ a ⊗ b, b ≤ a ⊗ b.

d(u)

d(u) ⊗ w(e) w(e)

({0, 1}, ∨, ∧, 0, 1) ([0, 1], max, ×, 0, 1) (R+ ∪ {+∞}, min, +, +∞, 0) (R ∪ {+∞}, min, +, +∞, 0)

SLIDE 67

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

Dijkstra does not require acyclicity
instead of topological order, we use best-first order
but this requires superiority of the semiring
intuition: combination always gets worse
contrast: monotonicity: combination preserves order

26

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is superior if for all a, b ∈ A a ≤ a ⊗ b, b ≤ a ⊗ b.

d(u)

d(u) ⊗ w(e) w(e)

({0, 1}, ∨, ∧, 0, 1) ([0, 1], max, ×, 0, 1) (R+ ∪ {+∞}, min, +, +∞, 0) (R ∪ {+∞}, min, +, +∞, 0)

SLIDE 68

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

keep a cut (S : V - S) where S vertices are fixed
maintain a priority queue Q of V - S vertices
each iteration choose the best vertex v from Q
move v to S, and use d(v) to forward-update others

27

S V - S

s

...

d(u) ⊕ = d(v) ⊗ w(v, u) time complexity: O((V+E) lgV) (binary heap) O(V lgV + E) (fib. heap)

v

SLIDE 69

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

keep a cut (S : V - S) where S vertices are fixed
maintain a priority queue Q of V - S vertices
each iteration choose the best vertex v from Q
move v to S, and use d(v) to forward-update others

27

S V - S

v

s

...

d(u) ⊕ = d(v) ⊗ w(v, u) time complexity: O((V+E) lgV) (binary heap) O(V lgV + E) (fib. heap)

SLIDE 70

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

keep a cut (S : V - S) where S vertices are fixed
maintain a priority queue Q of V - S vertices
each iteration choose the best vertex v from Q
move v to S, and use d(v) to forward-update others

27

u

w(v, u)

S V - S

v

s

...

d(u) ⊕ = d(v) ⊗ w(v, u) time complexity: O((V+E) lgV) (binary heap) O(V lgV + E) (fib. heap)

SLIDE 71

Liang Huang (Penn) Dynamic Programming

Viterbi vs. Dijkstra

structural vs. algebraic constraints
Dijkstra only applicable to optimization problems

28

monotonic optimization problems

SLIDE 72

Liang Huang (Penn) Dynamic Programming

Viterbi vs. Dijkstra

structural vs. algebraic constraints
Dijkstra only applicable to optimization problems

28

monotonic optimization problems

acyclic: Viterbi

SLIDE 73

Liang Huang (Penn) Dynamic Programming

Viterbi vs. Dijkstra

structural vs. algebraic constraints
Dijkstra only applicable to optimization problems

28

monotonic optimization problems

acyclic: Viterbi

superior: Dijkstra

SLIDE 74

Liang Huang (Penn) Dynamic Programming

Viterbi vs. Dijkstra

structural vs. algebraic constraints
Dijkstra only applicable to optimization problems

28

monotonic optimization problems

acyclic: Viterbi

superior: Dijkstra

many NLP problems

SLIDE 75

Liang Huang (Penn) Dynamic Programming

Viterbi vs. Dijkstra

structural vs. algebraic constraints
Dijkstra only applicable to optimization problems

28

monotonic optimization problems

acyclic: Viterbi

superior: Dijkstra

many NLP problems

forward-backward (Inside semiring)

SLIDE 76

Liang Huang (Penn) Dynamic Programming

Viterbi vs. Dijkstra

structural vs. algebraic constraints
Dijkstra only applicable to optimization problems

28

monotonic optimization problems

acyclic: Viterbi

superior: Dijkstra

many NLP problems

forward-backward (Inside semiring) non-probabilistic models

SLIDE 77

Liang Huang (Penn) Dynamic Programming

Viterbi vs. Dijkstra

structural vs. algebraic constraints
Dijkstra only applicable to optimization problems

28

monotonic optimization problems

acyclic: Viterbi

superior: Dijkstra

many NLP problems

forward-backward (Inside semiring) non-probabilistic models cyclic FSMs/ grammars

SLIDE 78

Liang Huang (Penn) Dynamic Programming

What if both fail?

29

monotonic optimization problems

acyclic: Viterbi

superior: Dijkstra

many NLP problems

generalized Bellman-Ford

(CLR, 1990; Mohri, 2002)

r, first do strongly-connected components (SCC)

which gives a DAG; use Viterbi globally on this SCC-DAG; use Bellman-Ford locally within each SCC

SLIDE 79

Liang Huang (Penn) Dynamic Programming

What if both work?

30

monotonic optimization problems

acyclic: Viterbi

superior: Dijkstra

many NLP problems

full Dijkstra is slower than Viterbi O((V + E) lgV) vs. O(V + E) but it can finish as early as the target vertex is popped a (V + E) lgV vs. V + E Q: how to (magically) reduce a?

SLIDE 80

Liang Huang (Penn) Dynamic Programming

A* Search: Intuition

Dijkstra is “blind” about how far the target is
may get “trapped” by obstacles
can we be more intelligent about the future?
idea: prioritize by s-v distance + v-t estimate

31

s v u t

SLIDE 81

Liang Huang (Penn) Dynamic Programming

A* Search: Intuition

Dijkstra is “blind” about how far the target is
may get “trapped” by obstacles
can we be more intelligent about the future?
idea: prioritize by s-v distance + v-t estimate

31

s v u t

SLIDE 82

Liang Huang (Penn) Dynamic Programming

A* Search: Intuition

Dijkstra is “blind” about how far the target is
may get “trapped” by obstacles
can we be more intelligent about the future?
idea: prioritize by s-v distance + v-t estimate

31

s v u t

SLIDE 83

Liang Huang (Penn) Dynamic Programming

A* Heuristic

h(v): the distance from v to target t
ĥ(v) must be an optimistic estimate of h(v): ĥ(v)≤ h(v)
Dijkstra is a special case where ĥ(v) = ī (0 for dist.)
now, prioritize the queue by d(v) ⊗ ĥ(v)
can stop when target gets popped -- why?
optimal subpaths should pop earlier than non-optimal
d(v) ⊗ ĥ(v) ≤ d(v) ⊗ h(v) ≤ d(t) ≤ non-optimal paths of t

32

s v t

d(v) h(v)

ĥ(v)

SLIDE 84

Liang Huang (Penn) Dynamic Programming

How to design a heuristic?

more of an art than science
basic idea: projection into coarser space
cluster: w’(U, V) = min { w(u, v) | u ∈ U, v ∈ V }
exact cost in coarser graph is estimate of finer graph

33 (Raphael, 2001)

SLIDE 85

Liang Huang (Penn) Dynamic Programming

How to design a heuristic?

more of an art than science
basic idea: projection into coarser space
cluster: w’(U, V) = min { w(u, v) | u ∈ U, v ∈ V }
exact cost in coarser graph is estimate of finer graph

33

U V U V

(Raphael, 2001)

SLIDE 86

Liang Huang (Penn) Dynamic Programming

Viterbi or A*?

A* intuition: d(t) ⊗ ĥ(t) ranks higher among d(v) ⊗ ĥ(v)
can finish early if lucky
actually, d(t) ⊗ ĥ(t) = d(t) ⊗ h(t) = d(t) ⊗ ī = d(t)
with the price of maintaining priority queue - O(log V)
Q: how early? worth the price?
if the rank is r, then A* is better when r/V log V < 1

34

Dijkstra d(v) pool d(t) A* d(v) ⊗ ĥ(v) pool d(t)

r

1 V

SLIDE 87

Liang Huang (Penn) Dynamic Programming

Viterbi or A*?

A* intuition: d(t) ⊗ ĥ(t) ranks higher among d(v) ⊗ ĥ(v)
can finish early if lucky
actually, d(t) ⊗ ĥ(t) = d(t) ⊗ h(t) = d(t) ⊗ ī = d(t)
with the price of maintaining priority queue - O(log V)
Q: how early? worth the price?
if the rank is r, then A* is better when r/V log V < 1

34

r < V / log V

Dijkstra d(v) pool d(t) A* d(v) ⊗ ĥ(v) pool d(t)

r

1 V

SLIDE 88

Liang Huang (Penn) Dynamic Programming

Two Dimensional Survey

35

topological (acyclic) best-first (superior) graphs with semirings (e.g., FSMs)

hypergraphs with weight functions (e.g., CFGs)

Viterbi Dijkstra Generalized Viterbi Knuth

traversing order search space

SLIDE 89

Liang Huang (Penn) Dynamic Programming

Two Dimensional Survey

35

topological (acyclic) best-first (superior) graphs with semirings (e.g., FSMs)

hypergraphs with weight functions (e.g., CFGs)

Viterbi Dijkstra Generalized Viterbi Knuth

(S, 0, n) w0 w1 ... wn-1

SLIDE 94

Liang Huang (Penn) Dynamic Programming

(Directed) Hypergraphs

a generalization of graphs
edge => hyperedge: several vertices to one vertex
e = (T(e), h(e), fe). arity |e| = |T(e)|
a totally-ordered weight set R
we borrow the ⊕ operator to be the comparison
weight function fe : R|e| to R
generalizes the ⊗ operator in semirings

38

v

u1 u2

fe

tails head

d(v) ⊕ = fe(d(u1), d(u2))

simple case: fe(a, b) = a ⊗ b ⊗ w(e)

Yi,j

e

Zj,k

Xi,k

SLIDE 95

Liang Huang (Penn) Dynamic Programming

Hypergraphs and Deduction

39

(Nederhof, 2003)

: b

v

u1 u2

fe

: a

: a × b × Pr(A → B C)

(A, i, j) (C, k, j) (B, i, k)

(B, i, k) (C, k, j) (A, i, j)

A→B C

SLIDE 96

Liang Huang (Penn) Dynamic Programming

Hypergraphs and Deduction

39

(Nederhof, 2003)

: b

v

u1 u2

fe

: a

: a × b × Pr(A → B C)

(A, i, j) (C, k, j) (B, i, k)

(B, i, k) (C, k, j) (A, i, j)

A→B C

v

u1 u2

tails

head

fe

: a

: fe (a,b)

v

u1 u2

fe

: a : b

: fe (a,b)

antecedents

consequent

: b

SLIDE 97

Liang Huang (Penn) Dynamic Programming

Related Formalisms

40

v

u1 u2

e

v

u1 u2

e

AND-node OR-node OR-nodes

SLIDE 98

Liang Huang (Penn) Dynamic Programming

Packed Forests

a compact representation of many parses
by sharing common sub-derivations
polynomial-space encoding of exponentially large set

41

(Klein and Manning, 2001; Huang and Chiang, 2005)

0 I 1 saw 2 him 3 with 4 a 5 mirror 6

SLIDE 99

Liang Huang (Penn) Dynamic Programming

Packed Forests

a compact representation of many parses
by sharing common sub-derivations
polynomial-space encoding of exponentially large set

41

(Klein and Manning, 2001; Huang and Chiang, 2005)

0 I 1 saw 2 him 3 with 4 a 5 mirror 6

nodes hyperedges

a hypergraph

SLIDE 100

Liang Huang (Penn) Dynamic Programming

Weight Functions and Semirings

42

v

u1 u2

tails head

uk

fe

... fe(a1, ..., ak)

SLIDE 101

Liang Huang (Penn) Dynamic Programming

Weight Functions and Semirings

42

v

u1 u2

tails head

uk

fe

... fe(a1, ..., ak) = a1 ⊗ ... ⊗ ak ⊗ w(e) special case

SLIDE 102

Liang Huang (Penn) Dynamic Programming

Weight Functions and Semirings

42

d(u) d(u) ⊗ w(e) w(e) d(u) fe(d(u)) fe

fe(a) = a ⊗ w(e)

v

u1 u2

tails head

uk

fe

... fe(a1, ..., ak) = a1 ⊗ ... ⊗ ak ⊗ w(e) special case

SLIDE 103

Liang Huang (Penn) Dynamic Programming

Weight Functions and Semirings

42

d(u) d(u) ⊗ w(e) w(e) d(u) fe(d(u)) fe

fe(a) = a ⊗ w(e)

v

u1 u2

tails head

uk

fe

... fe(a1, ..., ak)

semiring- composed

= a1 ⊗ ... ⊗ ak ⊗ w(e) special case

SLIDE 104

Liang Huang (Penn) Dynamic Programming

Weight Functions and Semirings

42

d(u) d(u) ⊗ w(e) w(e) d(u) fe(d(u)) fe

fe(a) = a ⊗ w(e)

v

u1 u2

tails head

uk

fe

... fe(a1, ..., ak)

semiring- composed can also extend monotonicity and superiority to general weight functions

= a1 ⊗ ... ⊗ ak ⊗ w(e) special case

SLIDE 105

Liang Huang (Penn) Dynamic Programming

Generalizing Semiring Properties

monotonicity
semiring: a ≤ b => a x c ≤ b x c
for all weight function f, for all a1... ak, for all i,

if a’i ≤ ai then f(a1... a’i ... ak) ≤ f(a1... ai ... ak)

superiority
semiring: a ≤ a x b, b ≤ a x b
for all f, for all a1... ak, for all i, ai ≤ f(a1, ..., ak)
acyclicity
degenerate a hypergraph back into a graph

43

SLIDE 106

Liang Huang (Penn) Dynamic Programming

Two Dimensional Survey

44

topological (acyclic) best-first (superior) graphs with semirings (e.g., FSMs)

hypergraphs with weight functions (e.g., CFGs)

Viterbi Dijkstra Generalized Viterbi Knuth

traversing order search space

SLIDE 107

Liang Huang (Penn) Dynamic Programming

Viterbi Algorithm for DAGs

1. topological sort
2. visit each vertex v in sorted order and do updates
for each incoming edge (u, v) in E
use d(u) to update d(v):
key observation: d(u) is fixed to optimal at this time
time complexity: O( V + E )

45

v u

w(u, v)

d(v) ⊕ = d(u) ⊗ w(u, v)

SLIDE 108

Liang Huang (Penn) Dynamic Programming

Viterbi Algorithm for DAHs

1. topological sort
2. visit each vertex v in sorted order and do updates
for each incoming hyperedge e = ((u1, .., u|e|), v, fe)
use d(ui)’s to update d(v)
key observation: d(ui)’s are fixed to optimal at this time
time complexity: O( V + E ) (assuming constant arity)

46

v

u1 u2

fe

d(v) ⊕ = fe(d(u1), · · · , d(u|e|))

SLIDE 109

Liang Huang (Penn) Dynamic Programming

Example: CKY Parsing

parsing with CFGs in Chomsky Normal Form (CNF)
typical instance of the generalized Viterbi for DAHs
many variants of CKY ~ various topological ordering

47

O(n3|P|)

(S, 0, n) (S, 0, n)

SLIDE 110

Liang Huang (Penn) Dynamic Programming

Example: CKY Parsing

parsing with CFGs in Chomsky Normal Form (CNF)
typical instance of the generalized Viterbi for DAHs
many variants of CKY ~ various topological ordering

47

O(n3|P|) bottom-up

(S, 0, n) (S, 0, n)

SLIDE 111

Liang Huang (Penn) Dynamic Programming

Example: CKY Parsing

parsing with CFGs in Chomsky Normal Form (CNF)
typical instance of the generalized Viterbi for DAHs
many variants of CKY ~ various topological ordering

47

O(n3|P|) bottom-up left-to-right

(S, 0, n) (S, 0, n)

SLIDE 112

Liang Huang (Penn) Dynamic Programming

Example: CKY Parsing

parsing with CFGs in Chomsky Normal Form (CNF)
typical instance of the generalized Viterbi for DAHs
many variants of CKY ~ various topological ordering

48

O(n3|P|) bottom-up left-to-right

(S, 0, n) (S, 0, n) (S, 0, n)

SLIDE 113

Liang Huang (Penn) Dynamic Programming

Example: CKY Parsing

parsing with CFGs in Chomsky Normal Form (CNF)
typical instance of the generalized Viterbi for DAHs
many variants of CKY ~ various topological ordering

48

O(n3|P|) bottom-up left-to-right right-to-left

(S, 0, n) (S, 0, n) (S, 0, n)

SLIDE 114

Liang Huang (Penn) Dynamic Programming

Example: Syntax-based MT

49

synchronous context-free grammars (SCFGs)
context-free grammar in two dimensions
generating pairs of strings/trees simultaneously
co-indexed nonterminal further rewritten as a unit

VP PP yu Shalong VP juxing le huitan VP VP held a meeting PP with Sharon

VP → PP(1) VP(2), VP(2) PP(1) VP → juxing le huitan, held a meeting PP → yu Shalong, with Sharon

SLIDE 115

Liang Huang (Penn) Dynamic Programming

Translation as Parsing

50

translation with SCFGs => monolingual parsing
parse the source input with the source projection
build the corresponding target sub-strings in parallel

PP1, 3 VP3, 6 VP1, 6

yu Shalong juxing le huitan

VP → PP(1) VP(2), VP(2) PP(1) VP → juxing le huitan, held a meeting PP → yu Shalong, with Sharon

SLIDE 116

Liang Huang (Penn) Dynamic Programming

Translation as Parsing

50

translation with SCFGs => monolingual parsing
parse the source input with the source projection
build the corresponding target sub-strings in parallel

PP1, 3 VP3, 6 VP1, 6

yu Shalong juxing le huitan

VP → PP(1) VP(2), VP(2) PP(1) VP → juxing le huitan, held a meeting PP → yu Shalong, with Sharon

SLIDE 117

Liang Huang (Penn) Dynamic Programming

Translation as Parsing

50

translation with SCFGs => monolingual parsing
parse the source input with the source projection
build the corresponding target sub-strings in parallel

PP1, 3 VP3, 6 VP1, 6

yu Shalong juxing le huitan

with Sharon held a talk held a talk with Sharon

VP → PP(1) VP(2), VP(2) PP(1) VP → juxing le huitan, held a meeting PP → yu Shalong, with Sharon

SLIDE 118

Liang Huang (Penn) Dynamic Programming

Translation as Parsing

50

translation with SCFGs => monolingual parsing
parse the source input with the source projection
build the corresponding target sub-strings in parallel

PP1, 3 VP3, 6 VP1, 6

yu Shalong juxing le huitan

with Sharon held a talk held a talk with Sharon

VP → PP(1) VP(2), VP(2) PP(1) VP → juxing le huitan, held a meeting PP → yu Shalong, with Sharon

complexity: same as CKY parsing -- O(n3)

SLIDE 119

Liang Huang (Penn) Dynamic Programming

Adding a Bigram Model

51

PP1, 3 VP3, 6 VP1, 6

_ _●

... talk

_ _ _ _ _

●●●● ... Sharon

_ _●

... talks

_ _●

... meeting
●●●● ... Shalong

with ... Sharon along ... Sharon with ... Shalong held ... talk held ... meeting hold ... talks

with Sharon

bigram

held ... talk

VP3, 6

with ... Sharon

PP1, 3

bigram

SLIDE 120

Liang Huang (Penn) Dynamic Programming

Adding a Bigram Model

51

PP1, 3 VP3, 6 VP1, 6

_ _●

... talk

_ _ _ _ _

●●●● ... Sharon

_ _●

... talks

_ _●

... meeting
●●●● ... Shalong

with ... Sharon along ... Sharon with ... Shalong held ... talk held ... meeting hold ... talks

with Sharon

bigram

held ... talk

VP3, 6

with ... Sharon

PP1, 3

bigram

held ... Sharon

VP1, 6

SLIDE 121

Liang Huang (Penn) Dynamic Programming

Adding a Bigram Model

51

PP1, 3 VP3, 6 VP1, 6

_ _●

... talk

_ _ _ _ _

●●●● ... Sharon

_ _●

... talks

_ _●

... meeting
●●●● ... Shalong

with ... Sharon along ... Sharon with ... Shalong held ... talk held ... meeting hold ... talks

with Sharon

bigram complexity: O(n3 V4(m-1) )

held ... talk

VP3, 6

with ... Sharon

PP1, 3

bigram

held ... Sharon

VP1, 6

SLIDE 122

Liang Huang (Penn) Dynamic Programming

Two Dimensional Survey

52

topological (acyclic) best-first (superior) graphs with semirings (e.g., FSMs)

hypergraphs with weight functions (e.g., CFGs)

Viterbi Dijkstra Generalized Viterbi Knuth

traversing order search space

SLIDE 123

Liang Huang (Penn) Dynamic Programming

Viterbi Algorithm for DAHs

1. topological sort
2. visit each vertex v in sorted order and do updates
for each incoming hyperedge e = ((u1, .., u|e|), v, fe)
use d(ui)’s to update d(v)
key observation: d(ui)’s are fixed to optimal at this time
time complexity: O( V + E ) (assuming constant arity)

53

v

u1 u2

fe

d(v) ⊕ = fe(d(u1), · · · , d(u|e|))

SLIDE 124

Liang Huang (Penn) Dynamic Programming

Forward Variant for DAHs

1. topological sort
2. visit each vertex v in sorted order and do updates
for each outgoing hyperedge e = ((u1, .., u|e|), h(e), fe)
if d(ui)’s have all been fixed to optimal
use d(ui)’s to update d(h(e))
time complexity: O( V + E )

55

S V - S

s

...

d(u) ⊕ = d(v) ⊗ w(v, u) time complexity: O((V+E) lgV) (binary heap) O(V lgV + E) (fib. heap)

v

SLIDE 128

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

keep a cut (S : V - S) where S vertices are fixed
maintain a priority queue Q of V - S vertices
each iteration choose the best vertex v from Q
move v to S, and use d(v) to forward-update others

55

S V - S

v

s

...

d(u) ⊕ = d(v) ⊗ w(v, u) time complexity: O((V+E) lgV) (binary heap) O(V lgV + E) (fib. heap)

SLIDE 129

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

keep a cut (S : V - S) where S vertices are fixed
maintain a priority queue Q of V - S vertices
each iteration choose the best vertex v from Q
move v to S, and use d(v) to forward-update others

55

u

w(v, u)

S V - S

v

s

...

d(u) ⊕ = d(v) ⊗ w(v, u) time complexity: O((V+E) lgV) (binary heap) O(V lgV + E) (fib. heap)

SLIDE 130

Liang Huang (Penn) Dynamic Programming

Knuth (1977) Algorithm

keep a cut (S : V - S) where S vertices are fixed
maintain a priority queue Q of V - S vertices
each iteration choose the best vertex v from Q
move v to S, and use d(v) to forward-update others

56

S V - S

v

s

...

time complexity: O((V+E) lgV) (binary heap) O(V lgV + E) (fib. heap)

u1

v

SLIDE 131

Liang Huang (Penn) Dynamic Programming

Knuth (1977) Algorithm

keep a cut (S : V - S) where S vertices are fixed
maintain a priority queue Q of V - S vertices
each iteration choose the best vertex v from Q
move v to S, and use d(v) to forward-update others

56

S V - S

v

s

...

time complexity: O((V+E) lgV) (binary heap) O(V lgV + E) (fib. heap)

u1

v

SLIDE 132

Liang Huang (Penn) Dynamic Programming

Knuth (1977) Algorithm

keep a cut (S : V - S) where S vertices are fixed
maintain a priority queue Q of V - S vertices
each iteration choose the best vertex v from Q
move v to S, and use d(v) to forward-update others

56

S V - S

v

s

...

time complexity: O((V+E) lgV) (binary heap) O(V lgV + E) (fib. heap)

u1

v

h(e)

fe

SLIDE 133

Liang Huang (Penn) Dynamic Programming

Knuth (1977) Algorithm

keep a cut (S : V - S) where S vertices are fixed
maintain a priority queue Q of V - S vertices
each iteration choose the best vertex v from Q
move v to S, and use d(v) to forward-update others

56

S V - S

v

s

...

time complexity: O((V+E) lgV) (binary heap) O(V lgV + E) (fib. heap)

u1

v

h(e)

fe

SLIDE 134

Liang Huang (Penn) Dynamic Programming

Example: Best-First/A* Parsing

Knuth for parsing: best-first (Caraballo & Charniak, 1998)
further speed-up: use A* heuristics
showed significant speed up with carefully designed

heuristic functions (Klein and Manning, 2003)

heuristic function: an estimate of outside cost

57

(S, 0, n)

SLIDE 135

Liang Huang (Penn) Dynamic Programming

Example: Best-First/A* Parsing

Knuth for parsing: best-first (Caraballo & Charniak, 1998)
further speed-up: use A* heuristics
showed significant speed up with carefully designed

heuristic functions (Klein and Manning, 2003)

heuristic function: an estimate of outside cost

57

(S, 0, n)

SLIDE 136

Liang Huang (Penn) Dynamic Programming

Outside Cost in Hypergraph

outside cost: yet to pay to reach goal
let’s only consider semiring-composed case
and only acyclic hypergraphs
after computing d(v) for all v from bottom-up
backwards Viterbi from top-down (outside-in)

58

e

... ...

h(S0,n) = ī h(v) ⊕= h(u)⊗w(e)⊗d(v’)

v v’

u

d(v)

s

v

t

d(v) h(v)

S0,n

h(v)

d(v)

SLIDE 137

Liang Huang (Penn) Dynamic Programming

Outside Cost in Hypergraph

outside cost: yet to pay to reach goal
let’s only consider semiring-composed case
and only acyclic hypergraphs
after computing d(v) for all v from bottom-up
backwards Viterbi from top-down (outside-in)

58

e

... ...

h(S0,n) = ī h(v) ⊕= h(u)⊗w(e)⊗d(v’)

v v’

u

d(v)

Q: d(v)⊗h(v) = ?

s

v

t

d(v) h(v)

S0,n

h(v)

d(v)

SLIDE 138

Liang Huang (Penn) Dynamic Programming

Projection-based Heuristics

how to guess? project onto a coarser-grained space
and parse with the coarser grammar
outside cost of of the coarser item as heuristics

59

(Klein and Manning, 2003)

SLIDE 139

Liang Huang (Penn) Dynamic Programming

Projection-based Heuristics

how to guess? project onto a coarser-grained space
and parse with the coarser grammar
outside cost of of the coarser item as heuristics

59

(Klein and Manning, 2003)

SLIDE 140

Liang Huang (Penn) Dynamic Programming

Projection-based Heuristics

how to guess? project onto a coarser-grained space
and parse with the coarser grammar
outside cost of of the coarser item as heuristics

60

(Klein and Manning, 2003)

SLIDE 141

Liang Huang (Penn) Dynamic Programming

Projection-based Heuristics

how to guess? project onto a coarser-grained space
and parse with the coarser grammar
outside cost of of the coarser item as heuristics

61

(Klein and Manning, 2003)

SLIDE 142

Liang Huang (Penn) Dynamic Programming

Projection-based Heuristics

how to guess? project onto a coarser-grained space
and parse with the coarser grammar
outside cost of of the coarser item as heuristics

61

(Klein and Manning, 2003)

SLIDE 143

Liang Huang (Penn) Dynamic Programming

Projection-based Heuristics

how to guess? project onto a coarser-grained space
and parse with the coarser grammar
outside cost of of the coarser item as heuristics

61

(Klein and Manning, 2003)

ĥ (VBD2,3) = h’ (V2,3)

SLIDE 144

Liang Huang (Penn) Dynamic Programming

Analogy with Graphs

62

SLIDE 145

Liang Huang (Penn) Dynamic Programming

Analogy with Graphs

62

SLIDE 146

Liang Huang (Penn) Dynamic Programming

More on Coarse-to-Fine

multilevel coarse-to-fine A*
heuristic = exact outside cost in previous stage
ĥi (v) = hi-1 (proj i-1(v))
VBD>V>X. ĥi (VBD1,5) = hi-1 (V1,5); ĥi-1 (V1,5) = hi-2 (X1,5)
multilevel coarse-to-fine Viterbi w/ beam-search
Viterbi + beam pruning in each stage
prune according to merit: d(v)⊗h(v) ⊘ d(TOP)
hard to derive a provably correct threshold
in practice: use a preset threshold (but works well!)

63

SLIDE 147

Liang Huang (Penn) Dynamic Programming

More on Coarse-to-Fine

multilevel coarse-to-fine A*
heuristic = exact outside cost in previous stage
ĥi (v) = hi-1 (proj i-1(v))
VBD>V>X. ĥi (VBD1,5) = hi-1 (V1,5); ĥi-1 (V1,5) = hi-2 (X1,5)
multilevel coarse-to-fine Viterbi w/ beam-search
Viterbi + beam pruning in each stage
prune according to merit: d(v)⊗h(v) ⊘ d(TOP)
hard to derive a provably correct threshold
in practice: use a preset threshold (but works well!)

63

SLIDE 148

Liang Huang (Penn) Dynamic Programming

More on Coarse-to-Fine

multilevel coarse-to-fine A*
heuristic = exact outside cost in previous stage
ĥi (v) = hi-1 (proj i-1(v))
VBD>V>X. ĥi (VBD1,5) = hi-1 (V1,5); ĥi-1 (V1,5) = hi-2 (X1,5)
multilevel coarse-to-fine Viterbi w/ beam-search
Viterbi + beam pruning in each stage
prune according to merit: d(v)⊗h(v) ⊘ d(TOP)
hard to derive a provably correct threshold
in practice: use a preset threshold (but works well!)

63

SLIDE 149

Liang Huang (Penn) Dynamic Programming

Same Picture Again

64

acyclic: Viterbi

superior: Knuth

many NLP problems

Inside-Outside Alg. (Inside semiring) non-prob. (discriminative) parsing cyclic grammars PCFG parsing with CNF generalized generalized Bellman-Ford (open)

SLIDE 155

Liang Huang (Penn) Dynamic Programming

Take Home Message

Dynamic Programming is cool, easy, and universal!
two frameworks and two types of algorithms
monotonicity; acyclicity and/or superiority
topological (Viterbi) vs. best-first style (Dijkstra/Knuth/A*)
when to choose which: A* can finish early if lucky
graph (lattice) vs. hypergraph (forest)
incremental, finite-state vs. branching, context-free
covered many typical NLP applications
a better understanding of theory helps in practice

65