Advanced Dynamic Programming in CL: Theory, Algorithms, and - - PowerPoint PPT Presentation

advanced dynamic programming in cl
SMART_READER_LITE
LIVE PREVIEW

Advanced Dynamic Programming in CL: Theory, Algorithms, and - - PowerPoint PPT Presentation

Advanced Dynamic Programming in CL: Theory, Algorithms, and Applications (S, 0, n) w 0 w 1 ... w n-1 Liang Huang University of Pennsylvania A Little Bit of History... Liang Huang (Penn) Dynamic Programming 2 A Little Bit of History...


slide-1
SLIDE 1

Advanced Dynamic Programming in CL:

Theory, Algorithms, and Applications

Liang Huang

University of Pennsylvania

(S, 0, n) w0 w1 ... wn-1

slide-2
SLIDE 2

Liang Huang (Penn) Dynamic Programming

A Little Bit of History...

2

slide-3
SLIDE 3

Liang Huang (Penn) Dynamic Programming

A Little Bit of History...

  • Who invented Dynamic Programming?

and when was it invented?

2

slide-4
SLIDE 4

Liang Huang (Penn) Dynamic Programming

A Little Bit of History...

  • Who invented Dynamic Programming?

and when was it invented?

  • R. Bellman (1940s-50s)
  • A. Viterbi (1967)
  • E. Dijkstra (1959)
  • Hart, Nilsson, and Raphael (1968)
  • Dijkstra => A* Algorithm
  • D. Knuth (1977)
  • Dijkstra on Grammar (Hypergraph)

2

Andrew Viterbi Richard Bellman

slide-5
SLIDE 5

Liang Huang (Penn) Dynamic Programming

A Little Bit of History...

  • Who invented Dynamic Programming?

and when was it invented?

  • R. Bellman (1940s-50s)
  • A. Viterbi (1967)
  • E. Dijkstra (1959)
  • Hart, Nilsson, and Raphael (1968)
  • Dijkstra => A* Algorithm
  • D. Knuth (1977)
  • Dijkstra on Grammar (Hypergraph)

2

Andrew Viterbi Richard Bellman

  • A. Turing
slide-6
SLIDE 6

Liang Huang (Penn) Dynamic Programming

Dynamic Programming

  • Dynamic Programming is everywhere in NLP
  • Viterbi Algorithm for Hidden Markov Models
  • CKY Algorithm for Parsing and Machine Translation
  • Forward-Backward and Inside-Outside Algorithms
  • Also everywhere in AI/ML
  • Reinforcement Learning, Planning (POMDP)
  • AI Search: Uniform-cost, A*, etc.
  • This tutorial: a unified theoretical view of DP
  • Focusing on Optimization Problems

3

slide-7
SLIDE 7

Liang Huang (Penn) Dynamic Programming

Review: DP Basics

  • DP = Divide-and-Conquer + Two Principles:
  • [required] Optimal Subproblem Property
  • [recommended] Sharing of Common Subproblems
  • Structure of the Search Space
  • Incremental
  • Graph
  • Knapsack, Edit Dist., Sequence Alignment
  • Branching
  • Hypergraph
  • Matrix-Chain, Polygon Triangulation, Optimal BST

4

slide-8
SLIDE 8

Liang Huang (Penn) Dynamic Programming

Two Dimensional Survey

5

topological

(acyclic)

best-first

(superior)

graphs with semirings

hypergraphs with weight functions

Viterbi Dijkstra Generalized Viterbi Knuth

traversing order search space

slide-9
SLIDE 9

Liang Huang (Penn) Dynamic Programming

Graphs in NLP

6

part-of-speech tagging lattice in speech

slide-10
SLIDE 10

Liang Huang (Penn) Dynamic Programming

Semirings on Graphs

  • in a weighted graph, we need two operators:
  • extension (multiplicative) and summary (additive)
  • the weight of a path is the product of edge weights
  • the weight of a vertex is the summary of path weights

7

s u

e1

v t

... ...

e2 e3

d(π1) =

  • ei∈π1

w(ei) = w(e1) ⊗ w(e2) ⊗ w(e3) d(t) =

  • πi

w(πi) = w(p1) ⊕ w(p2) ⊕ · · ·

slide-11
SLIDE 11

Liang Huang (Penn) Dynamic Programming

Semiring Definitions

8

A monoid is a triple (A, ⊗, 1) where

  • 1. ⊗ is a closed associative binary operator on the set A,
  • 2. 1 is the identity element for ⊗, i.e., for all a ∈ A, a ⊗ 1 = 1 ⊗ a = a.

A monoid is commutative if ⊗ is commutative.

slide-12
SLIDE 12

Liang Huang (Penn) Dynamic Programming

Semiring Definitions

8

A monoid is a triple (A, ⊗, 1) where

  • 1. ⊗ is a closed associative binary operator on the set A,
  • 2. 1 is the identity element for ⊗, i.e., for all a ∈ A, a ⊗ 1 = 1 ⊗ a = a.

A monoid is commutative if ⊗ is commutative.

([0, 1], +, 0) ([0, 1], ×, 1) ([0, 1], max, 0)

slide-13
SLIDE 13

Liang Huang (Penn) Dynamic Programming

Semiring Definitions

8

A monoid is a triple (A, ⊗, 1) where

  • 1. ⊗ is a closed associative binary operator on the set A,
  • 2. 1 is the identity element for ⊗, i.e., for all a ∈ A, a ⊗ 1 = 1 ⊗ a = a.

A monoid is commutative if ⊗ is commutative. A semiring is a 5-tuple R = (A, ⊕, ⊗, 0, 1) such that

  • 1. (A, ⊕, 0) is a commutative monoid.
  • 2. (A, ⊗, 1) is a monoid.
  • 3. ⊗ distributes over ⊕: for all a, b, c in A,

(a ⊕ b) ⊗ c = (a ⊗ c) ⊕ (b ⊗ c), c ⊗ (a ⊕ b) = (c ⊗ a) ⊕ (c ⊗ b).

  • 4. 0 is an annihilator for ⊗: for all a in A, 0 ⊗ a = a ⊗ 0 = 0.

([0, 1], +, 0) ([0, 1], ×, 1) ([0, 1], max, 0)

slide-14
SLIDE 14

Liang Huang (Penn) Dynamic Programming

Semiring Definitions

8

A monoid is a triple (A, ⊗, 1) where

  • 1. ⊗ is a closed associative binary operator on the set A,
  • 2. 1 is the identity element for ⊗, i.e., for all a ∈ A, a ⊗ 1 = 1 ⊗ a = a.

A monoid is commutative if ⊗ is commutative. A semiring is a 5-tuple R = (A, ⊕, ⊗, 0, 1) such that

  • 1. (A, ⊕, 0) is a commutative monoid.
  • 2. (A, ⊗, 1) is a monoid.
  • 3. ⊗ distributes over ⊕: for all a, b, c in A,

(a ⊕ b) ⊗ c = (a ⊗ c) ⊕ (b ⊗ c), c ⊗ (a ⊕ b) = (c ⊗ a) ⊕ (c ⊗ b).

  • 4. 0 is an annihilator for ⊗: for all a in A, 0 ⊗ a = a ⊗ 0 = 0.

([0, 1], +, 0) ([0, 1], ×, 1) ([0, 1], max, 0) ([0, 1], max, ×, 0, 1) ([0, 1], +, ×, 0, 1)

slide-15
SLIDE 15

Liang Huang (Penn) Dynamic Programming

Semiring Definitions

8

A monoid is a triple (A, ⊗, 1) where

  • 1. ⊗ is a closed associative binary operator on the set A,
  • 2. 1 is the identity element for ⊗, i.e., for all a ∈ A, a ⊗ 1 = 1 ⊗ a = a.

A monoid is commutative if ⊗ is commutative. A semiring is a 5-tuple R = (A, ⊕, ⊗, 0, 1) such that

  • 1. (A, ⊕, 0) is a commutative monoid.
  • 2. (A, ⊗, 1) is a monoid.
  • 3. ⊗ distributes over ⊕: for all a, b, c in A,

(a ⊕ b) ⊗ c = (a ⊗ c) ⊕ (b ⊗ c), c ⊗ (a ⊕ b) = (c ⊗ a) ⊕ (c ⊗ b).

  • 4. 0 is an annihilator for ⊗: for all a in A, 0 ⊗ a = a ⊗ 0 = 0.

([0, 1], +, 0) ([0, 1], ×, 1) ([0, 1], max, 0) ([0, 1], max, ×, 0, 1) ([0, 1], +, ×, 0, 1)

slide-16
SLIDE 16

Liang Huang (Penn) Dynamic Programming

Examples

9

Semiring Set ⊕ ⊗ 1 intuition/application Boolean {0, 1} ∨ ∧ 1 logical deduction, recognition Viterbi [0, 1] max × 1

  • prob. of the best derivation

Inside R+ ∪ {+∞} + × 1

  • prob. of a string

Real R ∪ {+∞} min + +∞ shortest-distance Tropical R+ ∪ {+∞} min + +∞ with non-negative weights Counting N + × 1 number of paths

slide-17
SLIDE 17

Liang Huang (Penn) Dynamic Programming

Ordering

10

slide-18
SLIDE 18

Liang Huang (Penn) Dynamic Programming

Ordering

  • idempotent

10

A semiring (A, ⊕, ⊗, 0, 1) is idempotent if for all a in A, a ⊕ a = a.

slide-19
SLIDE 19

Liang Huang (Penn) Dynamic Programming

Ordering

  • idempotent
  • comparison
  • examples: boolean, viterbi, tropical, real, ...

10

A semiring (A, ⊕, ⊗, 0, 1) is idempotent if for all a in A, a ⊕ a = a. (a ≤ b) ⇔ (a ⊕ b = a) defines a partial ordering. ({0, 1}, ∨, ∧, 0, 1) (R+ ∪ {+∞}, min, +, +∞, 0) ([0, 1], max, ⊗, 0, 1) (R ∪ {+∞}, min, +, +∞, 0)

slide-20
SLIDE 20

Liang Huang (Penn) Dynamic Programming

Ordering

  • idempotent
  • comparison
  • examples: boolean, viterbi, tropical, real, ...
  • total-order for optimization problems
  • examples: all of the above

10

A semiring (A, ⊕, ⊗, 0, 1) is idempotent if for all a in A, a ⊕ a = a. (a ≤ b) ⇔ (a ⊕ b = a) defines a partial ordering. A semiring is totally-ordered if ⊕ defines a total ordering. ({0, 1}, ∨, ∧, 0, 1) (R+ ∪ {+∞}, min, +, +∞, 0) ([0, 1], max, ⊗, 0, 1) (R ∪ {+∞}, min, +, +∞, 0)

slide-21
SLIDE 21

Liang Huang (Penn) Dynamic Programming

Monotonicity

11

slide-22
SLIDE 22

Liang Huang (Penn) Dynamic Programming

Monotonicity

  • monotonicity

11

slide-23
SLIDE 23

Liang Huang (Penn) Dynamic Programming

Monotonicity

  • monotonicity

11

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is monotonic if for all a, b, c ∈ A (a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)

slide-24
SLIDE 24

Liang Huang (Penn) Dynamic Programming

Monotonicity

  • monotonicity
  • optimal substructure in dynamic programming

11

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is monotonic if for all a, b, c ∈ A (a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)

slide-25
SLIDE 25

Liang Huang (Penn) Dynamic Programming

Monotonicity

  • monotonicity
  • optimal substructure in dynamic programming

11

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is monotonic if for all a, b, c ∈ A (a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)

B: b

C: c

A: b ⊗ c

slide-26
SLIDE 26

Liang Huang (Penn) Dynamic Programming

Monotonicity

  • monotonicity
  • optimal substructure in dynamic programming

11

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is monotonic if for all a, b, c ∈ A (a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)

B: b

C: c

A: b ⊗ c

B: b’ ≤ b

C: c

A: b’ ⊗ c ≤ b ⊗ c

slide-27
SLIDE 27

Liang Huang (Penn) Dynamic Programming

Monotonicity

  • monotonicity
  • optimal substructure in dynamic programming
  • idempotent => monotone (from distributivity)

11

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is monotonic if for all a, b, c ∈ A (a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)

B: b

C: c

A: b ⊗ c

B: b’ ≤ b

C: c

A: b’ ⊗ c ≤ b ⊗ c

slide-28
SLIDE 28

Liang Huang (Penn) Dynamic Programming

Monotonicity

  • monotonicity
  • optimal substructure in dynamic programming
  • idempotent => monotone (from distributivity)
  • (a+b)⊗c = (a⊗c)+(b⊗c); if a≤b, (a⊗c)=(a⊗c)+(b⊗c)

11

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is monotonic if for all a, b, c ∈ A (a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)

B: b

C: c

A: b ⊗ c

B: b’ ≤ b

C: c

A: b’ ⊗ c ≤ b ⊗ c

slide-29
SLIDE 29

Liang Huang (Penn) Dynamic Programming

Monotonicity

  • monotonicity
  • optimal substructure in dynamic programming
  • idempotent => monotone (from distributivity)
  • (a+b)⊗c = (a⊗c)+(b⊗c); if a≤b, (a⊗c)=(a⊗c)+(b⊗c)
  • by def. of comparison, a⊗c ≤ b⊗c

11

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is monotonic if for all a, b, c ∈ A (a ≤ b) ⇒ (a ⊗ c ≤ b ⊗ c) (a ≤ b) ⇒ (c ⊗ a ≤ c ⊗ b)

B: b

C: c

A: b ⊗ c

B: b’ ≤ b

C: c

A: b’ ⊗ c ≤ b ⊗ c

slide-30
SLIDE 30

Liang Huang (Penn) Dynamic Programming

DP on Graphs

  • optimization problems on graphs

=> generic shortest-path problem

  • weighted directed graph G=(V, E) with a function w

that assigns each edge a weight from a semiring

  • compute the best weight of the target vertex t
  • generic update along edge (u, v)
  • how to avoid cyclic updates?
  • only update when d(u) is fixed

12

v u

w(u, v)

d(v) ⊕ = d(u) ⊗ w(u, v)

d(v) ← d(v) ⊕ (d(u) ⊗ w(u, v))

slide-31
SLIDE 31

Liang Huang (Penn) Dynamic Programming

Two Dimensional Survey

13

topological (acyclic) best-first (superior) graphs with semirings (e.g., FSMs)

hypergraphs with weight functions (e.g., CFGs)

Viterbi Dijkstra Generalized Viterbi Knuth

traversing order search space

slide-32
SLIDE 32

Liang Huang (Penn) Dynamic Programming

Viterbi Algorithm for DAGs

  • 1. topological sort
  • 2. visit each vertex v in sorted order and do updates
  • for each incoming edge (u, v) in E
  • use d(u) to update d(v):
  • key observation: d(u) is fixed to optimal at this time
  • time complexity: O( V + E )

14

v u

w(u, v)

d(v) ⊕ = d(u) ⊗ w(u, v)

slide-33
SLIDE 33

Liang Huang (Penn) Dynamic Programming

Variant 1: forward-update

  • 1. topological sort
  • 2. visit each vertex v in sorted order and do updates
  • for each outgoing edge (v, u) in E
  • use d(v) to update d(u):
  • key observation: d(v) is fixed to optimal at this time
  • time complexity: O( V + E )

15

d(u) ⊕ = d(v) ⊗ w(v, u)

v u

w(v, u)

slide-34
SLIDE 34

Liang Huang (Penn) Dynamic Programming

Examples

16

slide-35
SLIDE 35

Liang Huang (Penn) Dynamic Programming

Examples

  • [Number of Paths in a DAG]

16

slide-36
SLIDE 36

Liang Huang (Penn) Dynamic Programming

Examples

  • [Number of Paths in a DAG]
  • just use the counting semiring (N, +, ×, 0, 1)
  • note: this is not an optimization problem!

16

slide-37
SLIDE 37

Liang Huang (Penn) Dynamic Programming

Examples

  • [Number of Paths in a DAG]
  • just use the counting semiring (N, +, ×, 0, 1)
  • note: this is not an optimization problem!
  • [Longest Path in a DAG]

16

slide-38
SLIDE 38

Liang Huang (Penn) Dynamic Programming

Examples

  • [Number of Paths in a DAG]
  • just use the counting semiring (N, +, ×, 0, 1)
  • note: this is not an optimization problem!
  • [Longest Path in a DAG]
  • just use the semiring

16

(R ∪ {−∞}, max, +, −∞, 0)

slide-39
SLIDE 39

Liang Huang (Penn) Dynamic Programming

Examples

  • [Number of Paths in a DAG]
  • just use the counting semiring (N, +, ×, 0, 1)
  • note: this is not an optimization problem!
  • [Longest Path in a DAG]
  • just use the semiring
  • [Part-of-Speech Tagging with a Hidden Markov Model]

16

(R ∪ {−∞}, max, +, −∞, 0)

slide-40
SLIDE 40

Liang Huang (Penn) Dynamic Programming

Examples

  • [Number of Paths in a DAG]
  • just use the counting semiring (N, +, ×, 0, 1)
  • note: this is not an optimization problem!
  • [Longest Path in a DAG]
  • just use the semiring
  • [Part-of-Speech Tagging with a Hidden Markov Model]

16

(R ∪ {−∞}, max, +, −∞, 0)

slide-41
SLIDE 41

Liang Huang (Penn) Dynamic Programming

Example: Speech Alignment

17

time complexity: O(n2) also used in:

edit distance biological sequence alignment

slide-42
SLIDE 42

Liang Huang (Penn) Dynamic Programming

Example: Word Alignment

18

  • key difference
  • reorderings in translation!
  • sequence/speech alignment

is always monotonic

  • complexity under HMM
  • word alignment is O(n3)
  • for every (i, j)
  • enumerate all (i-1, k)
  • sequence alignment O(n2)

I love you . Je t’

aime .

i i-1 j k

slide-43
SLIDE 43

Liang Huang (Penn) Dynamic Programming

Chinese Word Segmentation

19

下 雨 天 地 面 积 水

xia yu tian di mian ji shui

slide-44
SLIDE 44

Liang Huang (Penn) Dynamic Programming

Chinese Word Segmentation

19

民主

min-zhu

people-dominate

“democracy”

下 雨 天 地 面 积 水

xia yu tian di mian ji shui

slide-45
SLIDE 45

Liang Huang (Penn) Dynamic Programming

Chinese Word Segmentation

19

民主

min-zhu

people-dominate

“democracy”

江泽民 主席

jiang-ze-min zhu-xi

... - ... - people dominate-podium

“President Jiang Zemin”

下 雨 天 地 面 积 水

xia yu tian di mian ji shui

slide-46
SLIDE 46

Liang Huang (Penn) Dynamic Programming

Chinese Word Segmentation

19

民主

min-zhu

people-dominate

“democracy”

江泽民 主席

jiang-ze-min zhu-xi

... - ... - people dominate-podium

“President Jiang Zemin”

this was 5 years ago. now Google is good at segmentation!

下 雨 天 地 面 积 水

xia yu tian di mian ji shui

slide-47
SLIDE 47

Liang Huang (Penn) Dynamic Programming

Chinese Word Segmentation

19

民主

min-zhu

people-dominate

“democracy”

江泽民 主席

jiang-ze-min zhu-xi

... - ... - people dominate-podium

“President Jiang Zemin”

this was 5 years ago. now Google is good at segmentation!

下 雨 天 地 面 积 水

xia yu tian di mian ji shui

slide-48
SLIDE 48

Liang Huang (Penn) Dynamic Programming

Chinese Word Segmentation

19

民主

min-zhu

people-dominate

“democracy”

江泽民 主席

jiang-ze-min zhu-xi

... - ... - people dominate-podium

“President Jiang Zemin”

this was 5 years ago. now Google is good at segmentation!

下 雨 天 地 面 积 水

xia yu tian di mian ji shui

graph search

slide-49
SLIDE 49

Huang and Chiang Forest Rescoring

Phrase-based Decoding

20

yu Shalong juxing le huitan

与 沙龙 举行 了 会谈

held a talk with Sharon yu Shalong juxing le huitan

with Sharon held a talk talks Sharon held

with

slide-50
SLIDE 50

Huang and Chiang Forest Rescoring

Phrase-based Decoding

20

yu Shalong juxing le huitan

与 沙龙 举行 了 会谈

held a talk with Sharon yu Shalong juxing le huitan

with Sharon held a talk talks Sharon held

with

_ _ _ _ _

slide-51
SLIDE 51

Huang and Chiang Forest Rescoring

Phrase-based Decoding

20

yu Shalong juxing le huitan

与 沙龙 举行 了 会谈

held a talk with Sharon yu Shalong juxing le huitan

with Sharon held a talk talks Sharon held

with

_ _ _ _ _ _ _●

slide-52
SLIDE 52

Huang and Chiang Forest Rescoring

Phrase-based Decoding

20

yu Shalong juxing le huitan

与 沙龙 举行 了 会谈

held a talk with Sharon yu Shalong juxing le huitan

with Sharon held a talk talks Sharon held

with

_ _ _ _ _ _ _●

slide-53
SLIDE 53

Huang and Chiang Forest Rescoring

Phrase-based Decoding

21

yu Shalong juxing le huitan

与 沙龙 举行 了 会谈

held a talk with Sharon yu Shalong juxing le huitan

with Sharon held a talk talks Sharon held

with

_ _ _ _ _ _ _●

slide-54
SLIDE 54

Huang and Chiang Forest Rescoring

Phrase-based Decoding

21

yu Shalong juxing le huitan

与 沙龙 举行 了 会谈

held a talk with Sharon yu Shalong juxing le huitan

with Sharon held a talk talks Sharon held

with

_ _ _ _ _ _ _●

  • _●
slide-55
SLIDE 55

Huang and Chiang Forest Rescoring

Phrase-based Decoding

22

yu Shalong juxing le huitan

与 沙龙 举行 了 会谈

held a talk with Sharon

_ _●

  • held a talk

held a talk with Sharon

_ _ _ _ _

... ... ...

  • ...

_ _●

  • held a talk

source-side: coverage vector target-side: grow hypotheses strictly left-to-right

space: O(2n), time: O(2n n2) -- cf. traveling salesman problem

slide-56
SLIDE 56

Huang and Chiang Forest Rescoring

Traveling Salesman Problem & MT

  • a classical NP-hard problem
  • goal: visit each city once and only once
  • exponential-time dynamic programming
  • state: cities visited so far (bit-vector)
  • search in this O(2n) transformed graph
  • MT: each city is a source-language word
  • restrictions in reordering can reduce

complexity => distortion limit

  • => syntax-based MT

23

(Held and Karp, 1962; Knight, 1999)

slide-57
SLIDE 57

Huang and Chiang Forest Rescoring

Traveling Salesman Problem & MT

  • a classical NP-hard problem
  • goal: visit each city once and only once
  • exponential-time dynamic programming
  • state: cities visited so far (bit-vector)
  • search in this O(2n) transformed graph
  • MT: each city is a source-language word
  • restrictions in reordering can reduce

complexity => distortion limit

  • => syntax-based MT

23

(Held and Karp, 1962; Knight, 1999)

slide-58
SLIDE 58

Huang and Chiang Forest Rescoring

Adding a Bigram Model

  • “refined” graph: annotated with language model words
  • still dynamic programming, just larger search space

24

_ _●

  • ... talk

_ _ _ _ _

  • ●●●● ... Sharon

_ _●

  • ... talks

_ _●

  • ... meeting
  • ●●●● ... Shalong
slide-59
SLIDE 59

Huang and Chiang Forest Rescoring

Adding a Bigram Model

  • “refined” graph: annotated with language model words
  • still dynamic programming, just larger search space

24

_ _●

  • ... talk

_ _ _ _ _

  • ●●●● ... Sharon

_ _●

  • ... talks

_ _●

  • ... meeting
  • ●●●● ... Shalong

with Sharon

slide-60
SLIDE 60

Huang and Chiang Forest Rescoring

Adding a Bigram Model

  • “refined” graph: annotated with language model words
  • still dynamic programming, just larger search space

24

_ _●

  • ... talk

_ _ _ _ _

  • ●●●● ... Sharon

_ _●

  • ... talks

_ _●

  • ... meeting
  • ●●●● ... Shalong

with Sharon

bigram

slide-61
SLIDE 61

Huang and Chiang Forest Rescoring

Adding a Bigram Model

  • “refined” graph: annotated with language model words
  • still dynamic programming, just larger search space

24

_ _●

  • ... talk

_ _ _ _ _

  • ●●●● ... Sharon

_ _●

  • ... talks

_ _●

  • ... meeting
  • ●●●● ... Shalong

with Sharon

bigram

space: O(2n), time: O(2n n2) => space: O(2n Vm-1), time: O(2n Vm-1 n2) for m-gram language models

slide-62
SLIDE 62

Liang Huang (Penn) Dynamic Programming

Two Dimensional Survey

25

topological (acyclic) best-first (superior) graphs with semirings (e.g., FSMs)

hypergraphs with weight functions (e.g., CFGs)

Viterbi Dijkstra Generalized Viterbi Knuth

traversing order search space

slide-63
SLIDE 63

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

26

d(u)

d(u) ⊗ w(e) w(e)

slide-64
SLIDE 64

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

  • Dijkstra does not require acyclicity
  • instead of topological order, we use best-first order
  • but this requires superiority of the semiring
  • intuition: combination always gets worse

26

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is superior if for all a, b ∈ A a ≤ a ⊗ b, b ≤ a ⊗ b.

d(u)

d(u) ⊗ w(e) w(e)

slide-65
SLIDE 65

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

  • Dijkstra does not require acyclicity
  • instead of topological order, we use best-first order
  • but this requires superiority of the semiring
  • intuition: combination always gets worse
  • contrast: monotonicity: combination preserves order

26

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is superior if for all a, b ∈ A a ≤ a ⊗ b, b ≤ a ⊗ b.

d(u)

d(u) ⊗ w(e) w(e)

slide-66
SLIDE 66

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

  • Dijkstra does not require acyclicity
  • instead of topological order, we use best-first order
  • but this requires superiority of the semiring
  • intuition: combination always gets worse
  • contrast: monotonicity: combination preserves order

26

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is superior if for all a, b ∈ A a ≤ a ⊗ b, b ≤ a ⊗ b.

d(u)

d(u) ⊗ w(e) w(e)

({0, 1}, ∨, ∧, 0, 1) ([0, 1], max, ×, 0, 1) (R+ ∪ {+∞}, min, +, +∞, 0) (R ∪ {+∞}, min, +, +∞, 0)

slide-67
SLIDE 67

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

  • Dijkstra does not require acyclicity
  • instead of topological order, we use best-first order
  • but this requires superiority of the semiring
  • intuition: combination always gets worse
  • contrast: monotonicity: combination preserves order

26

Let K = (A, ⊕, ⊗, 0, 1) be a semiring, and ≤ a partial ordering over A. We say K is superior if for all a, b ∈ A a ≤ a ⊗ b, b ≤ a ⊗ b.

d(u)

d(u) ⊗ w(e) w(e)

({0, 1}, ∨, ∧, 0, 1) ([0, 1], max, ×, 0, 1) (R+ ∪ {+∞}, min, +, +∞, 0) (R ∪ {+∞}, min, +, +∞, 0)

slide-68
SLIDE 68

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

  • keep a cut (S : V - S) where S vertices are fixed
  • maintain a priority queue Q of V - S vertices
  • each iteration choose the best vertex v from Q
  • move v to S, and use d(v) to forward-update others

27

S V - S

s

...

d(u) ⊕ = d(v) ⊗ w(v, u) time complexity: O((V+E) lgV) (binary heap) O(V lgV + E) (fib. heap)

v

slide-69
SLIDE 69

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

  • keep a cut (S : V - S) where S vertices are fixed
  • maintain a priority queue Q of V - S vertices
  • each iteration choose the best vertex v from Q
  • move v to S, and use d(v) to forward-update others

27

S V - S

v

s

...

d(u) ⊕ = d(v) ⊗ w(v, u) time complexity: O((V+E) lgV) (binary heap) O(V lgV + E) (fib. heap)

slide-70
SLIDE 70

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

  • keep a cut (S : V - S) where S vertices are fixed
  • maintain a priority queue Q of V - S vertices
  • each iteration choose the best vertex v from Q
  • move v to S, and use d(v) to forward-update others

27

u

w(v, u)

S V - S

v

s

...

d(u) ⊕ = d(v) ⊗ w(v, u) time complexity: O((V+E) lgV) (binary heap) O(V lgV + E) (fib. heap)

slide-71
SLIDE 71

Liang Huang (Penn) Dynamic Programming

Viterbi vs. Dijkstra

  • structural vs. algebraic constraints
  • Dijkstra only applicable to optimization problems

28

monotonic optimization problems

slide-72
SLIDE 72

Liang Huang (Penn) Dynamic Programming

Viterbi vs. Dijkstra

  • structural vs. algebraic constraints
  • Dijkstra only applicable to optimization problems

28

monotonic optimization problems

acyclic: Viterbi

slide-73
SLIDE 73

Liang Huang (Penn) Dynamic Programming

Viterbi vs. Dijkstra

  • structural vs. algebraic constraints
  • Dijkstra only applicable to optimization problems

28

monotonic optimization problems

acyclic: Viterbi

superior: Dijkstra

slide-74
SLIDE 74

Liang Huang (Penn) Dynamic Programming

Viterbi vs. Dijkstra

  • structural vs. algebraic constraints
  • Dijkstra only applicable to optimization problems

28

monotonic optimization problems

acyclic: Viterbi

superior: Dijkstra

many NLP problems

slide-75
SLIDE 75

Liang Huang (Penn) Dynamic Programming

Viterbi vs. Dijkstra

  • structural vs. algebraic constraints
  • Dijkstra only applicable to optimization problems

28

monotonic optimization problems

acyclic: Viterbi

superior: Dijkstra

many NLP problems

forward-backward (Inside semiring)

slide-76
SLIDE 76

Liang Huang (Penn) Dynamic Programming

Viterbi vs. Dijkstra

  • structural vs. algebraic constraints
  • Dijkstra only applicable to optimization problems

28

monotonic optimization problems

acyclic: Viterbi

superior: Dijkstra

many NLP problems

forward-backward (Inside semiring) non-probabilistic models

slide-77
SLIDE 77

Liang Huang (Penn) Dynamic Programming

Viterbi vs. Dijkstra

  • structural vs. algebraic constraints
  • Dijkstra only applicable to optimization problems

28

monotonic optimization problems

acyclic: Viterbi

superior: Dijkstra

many NLP problems

forward-backward (Inside semiring) non-probabilistic models cyclic FSMs/ grammars

slide-78
SLIDE 78

Liang Huang (Penn) Dynamic Programming

What if both fail?

29

monotonic optimization problems

acyclic: Viterbi

superior: Dijkstra

many NLP problems

generalized Bellman-Ford

(CLR, 1990; Mohri, 2002)

  • r, first do strongly-connected components (SCC)

which gives a DAG; use Viterbi globally on this SCC-DAG; use Bellman-Ford locally within each SCC

slide-79
SLIDE 79

Liang Huang (Penn) Dynamic Programming

What if both work?

30

monotonic optimization problems

acyclic: Viterbi

superior: Dijkstra

many NLP problems

full Dijkstra is slower than Viterbi O((V + E) lgV) vs. O(V + E) but it can finish as early as the target vertex is popped a (V + E) lgV vs. V + E Q: how to (magically) reduce a?

slide-80
SLIDE 80

Liang Huang (Penn) Dynamic Programming

A* Search: Intuition

  • Dijkstra is “blind” about how far the target is
  • may get “trapped” by obstacles
  • can we be more intelligent about the future?
  • idea: prioritize by s-v distance + v-t estimate

31

s v u t

slide-81
SLIDE 81

Liang Huang (Penn) Dynamic Programming

A* Search: Intuition

  • Dijkstra is “blind” about how far the target is
  • may get “trapped” by obstacles
  • can we be more intelligent about the future?
  • idea: prioritize by s-v distance + v-t estimate

31

s v u t

slide-82
SLIDE 82

Liang Huang (Penn) Dynamic Programming

A* Search: Intuition

  • Dijkstra is “blind” about how far the target is
  • may get “trapped” by obstacles
  • can we be more intelligent about the future?
  • idea: prioritize by s-v distance + v-t estimate

31

s v u t

slide-83
SLIDE 83

Liang Huang (Penn) Dynamic Programming

A* Heuristic

  • h(v): the distance from v to target t
  • ĥ(v) must be an optimistic estimate of h(v): ĥ(v)≤ h(v)
  • Dijkstra is a special case where ĥ(v) = ī (0 for dist.)
  • now, prioritize the queue by d(v) ⊗ ĥ(v)
  • can stop when target gets popped -- why?
  • optimal subpaths should pop earlier than non-optimal
  • d(v) ⊗ ĥ(v) ≤ d(v) ⊗ h(v) ≤ d(t) ≤ non-optimal paths of t

32

s v t

d(v) h(v)

ĥ(v)

slide-84
SLIDE 84

Liang Huang (Penn) Dynamic Programming

How to design a heuristic?

  • more of an art than science
  • basic idea: projection into coarser space
  • cluster: w’(U, V) = min { w(u, v) | u ∈ U, v ∈ V }
  • exact cost in coarser graph is estimate of finer graph

33 (Raphael, 2001)

slide-85
SLIDE 85

Liang Huang (Penn) Dynamic Programming

How to design a heuristic?

  • more of an art than science
  • basic idea: projection into coarser space
  • cluster: w’(U, V) = min { w(u, v) | u ∈ U, v ∈ V }
  • exact cost in coarser graph is estimate of finer graph

33

U V U V

(Raphael, 2001)

slide-86
SLIDE 86

Liang Huang (Penn) Dynamic Programming

Viterbi or A*?

  • A* intuition: d(t) ⊗ ĥ(t) ranks higher among d(v) ⊗ ĥ(v)
  • can finish early if lucky
  • actually, d(t) ⊗ ĥ(t) = d(t) ⊗ h(t) = d(t) ⊗ ī = d(t)
  • with the price of maintaining priority queue - O(log V)
  • Q: how early? worth the price?
  • if the rank is r, then A* is better when r/V log V < 1

34

Dijkstra d(v) pool d(t) A* d(v) ⊗ ĥ(v) pool d(t)

r

1 V

slide-87
SLIDE 87

Liang Huang (Penn) Dynamic Programming

Viterbi or A*?

  • A* intuition: d(t) ⊗ ĥ(t) ranks higher among d(v) ⊗ ĥ(v)
  • can finish early if lucky
  • actually, d(t) ⊗ ĥ(t) = d(t) ⊗ h(t) = d(t) ⊗ ī = d(t)
  • with the price of maintaining priority queue - O(log V)
  • Q: how early? worth the price?
  • if the rank is r, then A* is better when r/V log V < 1

34

r < V / log V

Dijkstra d(v) pool d(t) A* d(v) ⊗ ĥ(v) pool d(t)

r

1 V

slide-88
SLIDE 88

Liang Huang (Penn) Dynamic Programming

Two Dimensional Survey

35

topological (acyclic) best-first (superior) graphs with semirings (e.g., FSMs)

hypergraphs with weight functions (e.g., CFGs)

Viterbi Dijkstra Generalized Viterbi Knuth

traversing order search space

slide-89
SLIDE 89

Liang Huang (Penn) Dynamic Programming

Two Dimensional Survey

35

topological (acyclic) best-first (superior) graphs with semirings (e.g., FSMs)

hypergraphs with weight functions (e.g., CFGs)

Viterbi Dijkstra Generalized Viterbi Knuth

traversing order search space

slide-90
SLIDE 90

Liang Huang (Penn) Dynamic Programming

Background: CFG and Parsing

36

(S, 0, n) w0 w1 ... wn-1

slide-91
SLIDE 91

Liang Huang (Penn) Dynamic Programming

Background: CFG and Parsing

36

(S, 0, n) w0 w1 ... wn-1

slide-92
SLIDE 92

Liang Huang (Penn) Dynamic Programming

Background: CFG and Parsing

37

(S, 0, n) w0 w1 ... wn-1

slide-93
SLIDE 93

Liang Huang (Penn) Dynamic Programming

Background: CFG and Parsing

37

(S, 0, n) w0 w1 ... wn-1

slide-94
SLIDE 94

Liang Huang (Penn) Dynamic Programming

(Directed) Hypergraphs

  • a generalization of graphs
  • edge => hyperedge: several vertices to one vertex
  • e = (T(e), h(e), fe). arity |e| = |T(e)|
  • a totally-ordered weight set R
  • we borrow the ⊕ operator to be the comparison
  • weight function fe : R|e| to R
  • generalizes the ⊗ operator in semirings

38

v

u1 u2

fe

tails head

d(v) ⊕ = fe(d(u1), d(u2))

simple case: fe(a, b) = a ⊗ b ⊗ w(e)

Yi,j

e

Zj,k

Xi,k

slide-95
SLIDE 95

Liang Huang (Penn) Dynamic Programming

Hypergraphs and Deduction

39

(Nederhof, 2003)

: b

v

u1 u2

fe

: a

: a × b × Pr(A → B C)

(A, i, j) (C, k, j) (B, i, k)

(B, i, k) (C, k, j) (A, i, j)

A→B C

slide-96
SLIDE 96

Liang Huang (Penn) Dynamic Programming

Hypergraphs and Deduction

39

(Nederhof, 2003)

: b

v

u1 u2

fe

: a

: a × b × Pr(A → B C)

(A, i, j) (C, k, j) (B, i, k)

(B, i, k) (C, k, j) (A, i, j)

A→B C

v

u1 u2

tails

head

fe

: a

: fe (a,b)

v

u1 u2

fe

: a : b

: fe (a,b)

antecedents

consequent

: b

slide-97
SLIDE 97

Liang Huang (Penn) Dynamic Programming

Related Formalisms

40

v

u1 u2

e

v

u1 u2

e

AND-node OR-node OR-nodes

slide-98
SLIDE 98

Liang Huang (Penn) Dynamic Programming

Packed Forests

  • a compact representation of many parses
  • by sharing common sub-derivations
  • polynomial-space encoding of exponentially large set

41

(Klein and Manning, 2001; Huang and Chiang, 2005)

0 I 1 saw 2 him 3 with 4 a 5 mirror 6

slide-99
SLIDE 99

Liang Huang (Penn) Dynamic Programming

Packed Forests

  • a compact representation of many parses
  • by sharing common sub-derivations
  • polynomial-space encoding of exponentially large set

41

(Klein and Manning, 2001; Huang and Chiang, 2005)

0 I 1 saw 2 him 3 with 4 a 5 mirror 6

nodes hyperedges

a hypergraph

slide-100
SLIDE 100

Liang Huang (Penn) Dynamic Programming

Weight Functions and Semirings

42

v

u1 u2

tails head

uk

fe

... fe(a1, ..., ak)

slide-101
SLIDE 101

Liang Huang (Penn) Dynamic Programming

Weight Functions and Semirings

42

v

u1 u2

tails head

uk

fe

... fe(a1, ..., ak) = a1 ⊗ ... ⊗ ak ⊗ w(e) special case

slide-102
SLIDE 102

Liang Huang (Penn) Dynamic Programming

Weight Functions and Semirings

42

d(u) d(u) ⊗ w(e) w(e) d(u) fe(d(u)) fe

fe(a) = a ⊗ w(e)

v

u1 u2

tails head

uk

fe

... fe(a1, ..., ak) = a1 ⊗ ... ⊗ ak ⊗ w(e) special case

slide-103
SLIDE 103

Liang Huang (Penn) Dynamic Programming

Weight Functions and Semirings

42

d(u) d(u) ⊗ w(e) w(e) d(u) fe(d(u)) fe

fe(a) = a ⊗ w(e)

v

u1 u2

tails head

uk

fe

... fe(a1, ..., ak)

semiring- composed

= a1 ⊗ ... ⊗ ak ⊗ w(e) special case

slide-104
SLIDE 104

Liang Huang (Penn) Dynamic Programming

Weight Functions and Semirings

42

d(u) d(u) ⊗ w(e) w(e) d(u) fe(d(u)) fe

fe(a) = a ⊗ w(e)

v

u1 u2

tails head

uk

fe

... fe(a1, ..., ak)

semiring- composed can also extend monotonicity and superiority to general weight functions

= a1 ⊗ ... ⊗ ak ⊗ w(e) special case

slide-105
SLIDE 105

Liang Huang (Penn) Dynamic Programming

Generalizing Semiring Properties

  • monotonicity
  • semiring: a ≤ b => a x c ≤ b x c
  • for all weight function f, for all a1... ak, for all i,

if a’i ≤ ai then f(a1... a’i ... ak) ≤ f(a1... ai ... ak)

  • superiority
  • semiring: a ≤ a x b, b ≤ a x b
  • for all f, for all a1... ak, for all i, ai ≤ f(a1, ..., ak)
  • acyclicity
  • degenerate a hypergraph back into a graph

43

slide-106
SLIDE 106

Liang Huang (Penn) Dynamic Programming

Two Dimensional Survey

44

topological (acyclic) best-first (superior) graphs with semirings (e.g., FSMs)

hypergraphs with weight functions (e.g., CFGs)

Viterbi Dijkstra Generalized Viterbi Knuth

traversing order search space

slide-107
SLIDE 107

Liang Huang (Penn) Dynamic Programming

Viterbi Algorithm for DAGs

  • 1. topological sort
  • 2. visit each vertex v in sorted order and do updates
  • for each incoming edge (u, v) in E
  • use d(u) to update d(v):
  • key observation: d(u) is fixed to optimal at this time
  • time complexity: O( V + E )

45

v u

w(u, v)

d(v) ⊕ = d(u) ⊗ w(u, v)

slide-108
SLIDE 108

Liang Huang (Penn) Dynamic Programming

Viterbi Algorithm for DAHs

  • 1. topological sort
  • 2. visit each vertex v in sorted order and do updates
  • for each incoming hyperedge e = ((u1, .., u|e|), v, fe)
  • use d(ui)’s to update d(v)
  • key observation: d(ui)’s are fixed to optimal at this time
  • time complexity: O( V + E ) (assuming constant arity)

46

v

u1 u2

fe

d(v) ⊕ = fe(d(u1), · · · , d(u|e|))

slide-109
SLIDE 109

Liang Huang (Penn) Dynamic Programming

Example: CKY Parsing

  • parsing with CFGs in Chomsky Normal Form (CNF)
  • typical instance of the generalized Viterbi for DAHs
  • many variants of CKY ~ various topological ordering

47

O(n3|P|)

(S, 0, n) (S, 0, n)

slide-110
SLIDE 110

Liang Huang (Penn) Dynamic Programming

Example: CKY Parsing

  • parsing with CFGs in Chomsky Normal Form (CNF)
  • typical instance of the generalized Viterbi for DAHs
  • many variants of CKY ~ various topological ordering

47

O(n3|P|) bottom-up

(S, 0, n) (S, 0, n)

slide-111
SLIDE 111

Liang Huang (Penn) Dynamic Programming

Example: CKY Parsing

  • parsing with CFGs in Chomsky Normal Form (CNF)
  • typical instance of the generalized Viterbi for DAHs
  • many variants of CKY ~ various topological ordering

47

O(n3|P|) bottom-up left-to-right

(S, 0, n) (S, 0, n)

slide-112
SLIDE 112

Liang Huang (Penn) Dynamic Programming

Example: CKY Parsing

  • parsing with CFGs in Chomsky Normal Form (CNF)
  • typical instance of the generalized Viterbi for DAHs
  • many variants of CKY ~ various topological ordering

48

O(n3|P|) bottom-up left-to-right

(S, 0, n) (S, 0, n) (S, 0, n)

slide-113
SLIDE 113

Liang Huang (Penn) Dynamic Programming

Example: CKY Parsing

  • parsing with CFGs in Chomsky Normal Form (CNF)
  • typical instance of the generalized Viterbi for DAHs
  • many variants of CKY ~ various topological ordering

48

O(n3|P|) bottom-up left-to-right right-to-left

(S, 0, n) (S, 0, n) (S, 0, n)

slide-114
SLIDE 114

Liang Huang (Penn) Dynamic Programming

Example: Syntax-based MT

49

  • synchronous context-free grammars (SCFGs)
  • context-free grammar in two dimensions
  • generating pairs of strings/trees simultaneously
  • co-indexed nonterminal further rewritten as a unit

VP PP yu Shalong VP juxing le huitan VP VP held a meeting PP with Sharon

VP → PP(1) VP(2), VP(2) PP(1) VP → juxing le huitan, held a meeting PP → yu Shalong, with Sharon

slide-115
SLIDE 115

Liang Huang (Penn) Dynamic Programming

Translation as Parsing

50

  • translation with SCFGs => monolingual parsing
  • parse the source input with the source projection
  • build the corresponding target sub-strings in parallel

PP1, 3 VP3, 6 VP1, 6

yu Shalong juxing le huitan

VP → PP(1) VP(2), VP(2) PP(1) VP → juxing le huitan, held a meeting PP → yu Shalong, with Sharon

slide-116
SLIDE 116

Liang Huang (Penn) Dynamic Programming

Translation as Parsing

50

  • translation with SCFGs => monolingual parsing
  • parse the source input with the source projection
  • build the corresponding target sub-strings in parallel

PP1, 3 VP3, 6 VP1, 6

yu Shalong juxing le huitan

VP → PP(1) VP(2), VP(2) PP(1) VP → juxing le huitan, held a meeting PP → yu Shalong, with Sharon

slide-117
SLIDE 117

Liang Huang (Penn) Dynamic Programming

Translation as Parsing

50

  • translation with SCFGs => monolingual parsing
  • parse the source input with the source projection
  • build the corresponding target sub-strings in parallel

PP1, 3 VP3, 6 VP1, 6

yu Shalong juxing le huitan

with Sharon held a talk held a talk with Sharon

VP → PP(1) VP(2), VP(2) PP(1) VP → juxing le huitan, held a meeting PP → yu Shalong, with Sharon

slide-118
SLIDE 118

Liang Huang (Penn) Dynamic Programming

Translation as Parsing

50

  • translation with SCFGs => monolingual parsing
  • parse the source input with the source projection
  • build the corresponding target sub-strings in parallel

PP1, 3 VP3, 6 VP1, 6

yu Shalong juxing le huitan

with Sharon held a talk held a talk with Sharon

VP → PP(1) VP(2), VP(2) PP(1) VP → juxing le huitan, held a meeting PP → yu Shalong, with Sharon

complexity: same as CKY parsing -- O(n3)

slide-119
SLIDE 119

Liang Huang (Penn) Dynamic Programming

Adding a Bigram Model

51

PP1, 3 VP3, 6 VP1, 6

_ _●

  • ... talk

_ _ _ _ _

  • ●●●● ... Sharon

_ _●

  • ... talks

_ _●

  • ... meeting
  • ●●●● ... Shalong

with ... Sharon along ... Sharon with ... Shalong held ... talk held ... meeting hold ... talks

with Sharon

bigram

held ... talk

VP3, 6

with ... Sharon

PP1, 3

bigram

slide-120
SLIDE 120

Liang Huang (Penn) Dynamic Programming

Adding a Bigram Model

51

PP1, 3 VP3, 6 VP1, 6

_ _●

  • ... talk

_ _ _ _ _

  • ●●●● ... Sharon

_ _●

  • ... talks

_ _●

  • ... meeting
  • ●●●● ... Shalong

with ... Sharon along ... Sharon with ... Shalong held ... talk held ... meeting hold ... talks

with Sharon

bigram

held ... talk

VP3, 6

with ... Sharon

PP1, 3

bigram

held ... Sharon

VP1, 6

slide-121
SLIDE 121

Liang Huang (Penn) Dynamic Programming

Adding a Bigram Model

51

PP1, 3 VP3, 6 VP1, 6

_ _●

  • ... talk

_ _ _ _ _

  • ●●●● ... Sharon

_ _●

  • ... talks

_ _●

  • ... meeting
  • ●●●● ... Shalong

with ... Sharon along ... Sharon with ... Shalong held ... talk held ... meeting hold ... talks

with Sharon

bigram complexity: O(n3 V4(m-1) )

held ... talk

VP3, 6

with ... Sharon

PP1, 3

bigram

held ... Sharon

VP1, 6

slide-122
SLIDE 122

Liang Huang (Penn) Dynamic Programming

Two Dimensional Survey

52

topological (acyclic) best-first (superior) graphs with semirings (e.g., FSMs)

hypergraphs with weight functions (e.g., CFGs)

Viterbi Dijkstra Generalized Viterbi Knuth

traversing order search space

slide-123
SLIDE 123

Liang Huang (Penn) Dynamic Programming

Viterbi Algorithm for DAHs

  • 1. topological sort
  • 2. visit each vertex v in sorted order and do updates
  • for each incoming hyperedge e = ((u1, .., u|e|), v, fe)
  • use d(ui)’s to update d(v)
  • key observation: d(ui)’s are fixed to optimal at this time
  • time complexity: O( V + E ) (assuming constant arity)

53

v

u1 u2

fe

d(v) ⊕ = fe(d(u1), · · · , d(u|e|))

slide-124
SLIDE 124

Liang Huang (Penn) Dynamic Programming

Forward Variant for DAHs

  • 1. topological sort
  • 2. visit each vertex v in sorted order and do updates
  • for each outgoing hyperedge e = ((u1, .., u|e|), h(e), fe)
  • if d(ui)’s have all been fixed to optimal
  • use d(ui)’s to update d(h(e))
  • time complexity: O( V + E )

54

v = ui

h(e) u1

v

fe

u2 =

h(e)

fe

slide-125
SLIDE 125

Liang Huang (Penn) Dynamic Programming

Forward Variant for DAHs

  • 1. topological sort
  • 2. visit each vertex v in sorted order and do updates
  • for each outgoing hyperedge e = ((u1, .., u|e|), h(e), fe)
  • if d(ui)’s have all been fixed to optimal
  • use d(ui)’s to update d(h(e))
  • time complexity: O( V + E )

54

v = ui

h(e) u1

v

fe

u2 =

h(e)

fe

slide-126
SLIDE 126

Liang Huang (Penn) Dynamic Programming

Forward Variant for DAHs

  • 1. topological sort
  • 2. visit each vertex v in sorted order and do updates
  • for each outgoing hyperedge e = ((u1, .., u|e|), h(e), fe)
  • if d(ui)’s have all been fixed to optimal
  • use d(ui)’s to update d(h(e))
  • time complexity: O( V + E )

54

v = ui

h(e) u1

v

fe

u2 = Q: how to avoid repeated checking? maintain a counter r[e] for each e: how many tails yet to be fixed? fire this hyperedge only if r[e]=0

h(e)

fe

slide-127
SLIDE 127

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

  • keep a cut (S : V - S) where S vertices are fixed
  • maintain a priority queue Q of V - S vertices
  • each iteration choose the best vertex v from Q
  • move v to S, and use d(v) to forward-update others

55

S V - S

s

...

d(u) ⊕ = d(v) ⊗ w(v, u) time complexity: O((V+E) lgV) (binary heap) O(V lgV + E) (fib. heap)

v

slide-128
SLIDE 128

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

  • keep a cut (S : V - S) where S vertices are fixed
  • maintain a priority queue Q of V - S vertices
  • each iteration choose the best vertex v from Q
  • move v to S, and use d(v) to forward-update others

55

S V - S

v

s

...

d(u) ⊕ = d(v) ⊗ w(v, u) time complexity: O((V+E) lgV) (binary heap) O(V lgV + E) (fib. heap)

slide-129
SLIDE 129

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

  • keep a cut (S : V - S) where S vertices are fixed
  • maintain a priority queue Q of V - S vertices
  • each iteration choose the best vertex v from Q
  • move v to S, and use d(v) to forward-update others

55

u

w(v, u)

S V - S

v

s

...

d(u) ⊕ = d(v) ⊗ w(v, u) time complexity: O((V+E) lgV) (binary heap) O(V lgV + E) (fib. heap)

slide-130
SLIDE 130

Liang Huang (Penn) Dynamic Programming

Knuth (1977) Algorithm

  • keep a cut (S : V - S) where S vertices are fixed
  • maintain a priority queue Q of V - S vertices
  • each iteration choose the best vertex v from Q
  • move v to S, and use d(v) to forward-update others

56

S V - S

v

s

...

time complexity: O((V+E) lgV) (binary heap) O(V lgV + E) (fib. heap)

u1

v

slide-131
SLIDE 131

Liang Huang (Penn) Dynamic Programming

Knuth (1977) Algorithm

  • keep a cut (S : V - S) where S vertices are fixed
  • maintain a priority queue Q of V - S vertices
  • each iteration choose the best vertex v from Q
  • move v to S, and use d(v) to forward-update others

56

S V - S

v

s

...

time complexity: O((V+E) lgV) (binary heap) O(V lgV + E) (fib. heap)

u1

v

slide-132
SLIDE 132

Liang Huang (Penn) Dynamic Programming

Knuth (1977) Algorithm

  • keep a cut (S : V - S) where S vertices are fixed
  • maintain a priority queue Q of V - S vertices
  • each iteration choose the best vertex v from Q
  • move v to S, and use d(v) to forward-update others

56

S V - S

v

s

...

time complexity: O((V+E) lgV) (binary heap) O(V lgV + E) (fib. heap)

u1

v

h(e)

fe

slide-133
SLIDE 133

Liang Huang (Penn) Dynamic Programming

Knuth (1977) Algorithm

  • keep a cut (S : V - S) where S vertices are fixed
  • maintain a priority queue Q of V - S vertices
  • each iteration choose the best vertex v from Q
  • move v to S, and use d(v) to forward-update others

56

S V - S

v

s

...

time complexity: O((V+E) lgV) (binary heap) O(V lgV + E) (fib. heap)

u1

v

h(e)

fe

slide-134
SLIDE 134

Liang Huang (Penn) Dynamic Programming

Example: Best-First/A* Parsing

  • Knuth for parsing: best-first (Caraballo & Charniak, 1998)
  • further speed-up: use A* heuristics
  • showed significant speed up with carefully designed

heuristic functions (Klein and Manning, 2003)

  • heuristic function: an estimate of outside cost

57

(S, 0, n)

slide-135
SLIDE 135

Liang Huang (Penn) Dynamic Programming

Example: Best-First/A* Parsing

  • Knuth for parsing: best-first (Caraballo & Charniak, 1998)
  • further speed-up: use A* heuristics
  • showed significant speed up with carefully designed

heuristic functions (Klein and Manning, 2003)

  • heuristic function: an estimate of outside cost

57

(S, 0, n)

slide-136
SLIDE 136

Liang Huang (Penn) Dynamic Programming

Outside Cost in Hypergraph

  • outside cost: yet to pay to reach goal
  • let’s only consider semiring-composed case
  • and only acyclic hypergraphs
  • after computing d(v) for all v from bottom-up
  • backwards Viterbi from top-down (outside-in)

58

e

... ...

h(S0,n) = ī h(v) ⊕= h(u)⊗w(e)⊗d(v’)

v v’

u

d(v)

s

v

t

d(v) h(v)

S0,n

h(v)

d(v)

slide-137
SLIDE 137

Liang Huang (Penn) Dynamic Programming

Outside Cost in Hypergraph

  • outside cost: yet to pay to reach goal
  • let’s only consider semiring-composed case
  • and only acyclic hypergraphs
  • after computing d(v) for all v from bottom-up
  • backwards Viterbi from top-down (outside-in)

58

e

... ...

h(S0,n) = ī h(v) ⊕= h(u)⊗w(e)⊗d(v’)

v v’

u

d(v)

Q: d(v)⊗h(v) = ?

s

v

t

d(v) h(v)

S0,n

h(v)

d(v)

slide-138
SLIDE 138

Liang Huang (Penn) Dynamic Programming

Projection-based Heuristics

  • how to guess? project onto a coarser-grained space
  • and parse with the coarser grammar
  • outside cost of of the coarser item as heuristics

59

(Klein and Manning, 2003)

slide-139
SLIDE 139

Liang Huang (Penn) Dynamic Programming

Projection-based Heuristics

  • how to guess? project onto a coarser-grained space
  • and parse with the coarser grammar
  • outside cost of of the coarser item as heuristics

59

(Klein and Manning, 2003)

slide-140
SLIDE 140

Liang Huang (Penn) Dynamic Programming

Projection-based Heuristics

  • how to guess? project onto a coarser-grained space
  • and parse with the coarser grammar
  • outside cost of of the coarser item as heuristics

60

(Klein and Manning, 2003)

slide-141
SLIDE 141

Liang Huang (Penn) Dynamic Programming

Projection-based Heuristics

  • how to guess? project onto a coarser-grained space
  • and parse with the coarser grammar
  • outside cost of of the coarser item as heuristics

61

(Klein and Manning, 2003)

slide-142
SLIDE 142

Liang Huang (Penn) Dynamic Programming

Projection-based Heuristics

  • how to guess? project onto a coarser-grained space
  • and parse with the coarser grammar
  • outside cost of of the coarser item as heuristics

61

(Klein and Manning, 2003)

slide-143
SLIDE 143

Liang Huang (Penn) Dynamic Programming

Projection-based Heuristics

  • how to guess? project onto a coarser-grained space
  • and parse with the coarser grammar
  • outside cost of of the coarser item as heuristics

61

(Klein and Manning, 2003)

ĥ (VBD2,3) = h’ (V2,3)

slide-144
SLIDE 144

Liang Huang (Penn) Dynamic Programming

Analogy with Graphs

62

slide-145
SLIDE 145

Liang Huang (Penn) Dynamic Programming

Analogy with Graphs

62

slide-146
SLIDE 146

Liang Huang (Penn) Dynamic Programming

More on Coarse-to-Fine

  • multilevel coarse-to-fine A*
  • heuristic = exact outside cost in previous stage
  • ĥi (v) = hi-1 (proj i-1(v))
  • VBD>V>X. ĥi (VBD1,5) = hi-1 (V1,5); ĥi-1 (V1,5) = hi-2 (X1,5)
  • multilevel coarse-to-fine Viterbi w/ beam-search
  • Viterbi + beam pruning in each stage
  • prune according to merit: d(v)⊗h(v) ⊘ d(TOP)
  • hard to derive a provably correct threshold
  • in practice: use a preset threshold (but works well!)

63

slide-147
SLIDE 147

Liang Huang (Penn) Dynamic Programming

More on Coarse-to-Fine

  • multilevel coarse-to-fine A*
  • heuristic = exact outside cost in previous stage
  • ĥi (v) = hi-1 (proj i-1(v))
  • VBD>V>X. ĥi (VBD1,5) = hi-1 (V1,5); ĥi-1 (V1,5) = hi-2 (X1,5)
  • multilevel coarse-to-fine Viterbi w/ beam-search
  • Viterbi + beam pruning in each stage
  • prune according to merit: d(v)⊗h(v) ⊘ d(TOP)
  • hard to derive a provably correct threshold
  • in practice: use a preset threshold (but works well!)

63

slide-148
SLIDE 148

Liang Huang (Penn) Dynamic Programming

More on Coarse-to-Fine

  • multilevel coarse-to-fine A*
  • heuristic = exact outside cost in previous stage
  • ĥi (v) = hi-1 (proj i-1(v))
  • VBD>V>X. ĥi (VBD1,5) = hi-1 (V1,5); ĥi-1 (V1,5) = hi-2 (X1,5)
  • multilevel coarse-to-fine Viterbi w/ beam-search
  • Viterbi + beam pruning in each stage
  • prune according to merit: d(v)⊗h(v) ⊘ d(TOP)
  • hard to derive a provably correct threshold
  • in practice: use a preset threshold (but works well!)

63

slide-149
SLIDE 149

Liang Huang (Penn) Dynamic Programming

Same Picture Again

64

monotonic optimization problems

acyclic: Viterbi

superior: Knuth

many NLP problems

slide-150
SLIDE 150

Liang Huang (Penn) Dynamic Programming

Same Picture Again

64

monotonic optimization problems

acyclic: Viterbi

superior: Knuth

many NLP problems

PCFG parsing with CNF

slide-151
SLIDE 151

Liang Huang (Penn) Dynamic Programming

Same Picture Again

64

monotonic optimization problems

acyclic: Viterbi

superior: Knuth

many NLP problems

Inside-Outside Alg. (Inside semiring) PCFG parsing with CNF

slide-152
SLIDE 152

Liang Huang (Penn) Dynamic Programming

Same Picture Again

64

monotonic optimization problems

acyclic: Viterbi

superior: Knuth

many NLP problems

Inside-Outside Alg. (Inside semiring) non-prob. (discriminative) parsing PCFG parsing with CNF

slide-153
SLIDE 153

Liang Huang (Penn) Dynamic Programming

Same Picture Again

64

monotonic optimization problems

acyclic: Viterbi

superior: Knuth

many NLP problems

Inside-Outside Alg. (Inside semiring) non-prob. (discriminative) parsing cyclic grammars PCFG parsing with CNF

slide-154
SLIDE 154

Liang Huang (Penn) Dynamic Programming

Same Picture Again

64

monotonic optimization problems

acyclic: Viterbi

superior: Knuth

many NLP problems

Inside-Outside Alg. (Inside semiring) non-prob. (discriminative) parsing cyclic grammars PCFG parsing with CNF generalized generalized Bellman-Ford (open)

slide-155
SLIDE 155

Liang Huang (Penn) Dynamic Programming

Take Home Message

  • Dynamic Programming is cool, easy, and universal!
  • two frameworks and two types of algorithms
  • monotonicity; acyclicity and/or superiority
  • topological (Viterbi) vs. best-first style (Dijkstra/Knuth/A*)
  • when to choose which: A* can finish early if lucky
  • graph (lattice) vs. hypergraph (forest)
  • incremental, finite-state vs. branching, context-free
  • covered many typical NLP applications
  • a better understanding of theory helps in practice

65

slide-156
SLIDE 156

THE END - Thanks!

Thanks!

66

final slides will be available on my website.

(S, 0, n) w0 w1 ... wn-1

Questions? Comments?