Forest-based Algorithms in Natural Language Processing Liang Huang - - PowerPoint PPT Presentation

forest based algorithms in
SMART_READER_LITE
LIVE PREVIEW

Forest-based Algorithms in Natural Language Processing Liang Huang - - PowerPoint PPT Presentation

Forest-based Algorithms in Natural Language Processing Liang Huang overview of Ph.D. work done at Penn (and ISI, ICT) INSTITUTE OF COMPUTING TECHNOLOGY includes joint work with David Chiang, Kevin Knight, Aravind Joshi, Haitao Mi and Qun Liu


slide-1
SLIDE 1

Forest-based Algorithms in

Natural Language Processing

includes joint work with David Chiang, Kevin Knight, Aravind Joshi, Haitao Mi and Qun Liu

CMU LTI Seminar, Pittsburgh, PA, May 14, 2009

INSTITUTE OF COMPUTING TECHNOLOGY

Liang Huang

  • verview of Ph.D. work done at Penn (and ISI, ICT)
slide-2
SLIDE 2

Forest Algorithms

NLP is all about ambiguities

  • to middle school kids: what does this sentence mean?

2

Aravind Joshi

I saw her duck.

slide-3
SLIDE 3

Forest Algorithms

NLP is all about ambiguities

  • to middle school kids: what does this sentence mean?

2

Aravind Joshi

I saw her duck.

slide-4
SLIDE 4

Forest Algorithms

NLP is all about ambiguities

3

Aravind Joshi

I eat sushi with tuna.

  • to middle school kids: what does this sentence mean?
slide-5
SLIDE 5

Forest Algorithms

NLP is all about ambiguities

4

I saw her duck.

slide-6
SLIDE 6

Forest Algorithms

NLP is all about ambiguities

4

I saw her duck.

slide-7
SLIDE 7

Forest Algorithms

NLP is all about ambiguities

  • how about...
  • I saw her duck with a telescope.
  • I saw her duck with a telescope in the garden...

4

... I saw her duck.

slide-8
SLIDE 8

Forest Algorithms

NLP is HARD!

  • exponential explosion of the search space
  • non-local dependencies (context)

5

...

S NP PRP I VP VBD saw NP PRP$ her NN duck PP IN with NP DT a NN telescope

slide-9
SLIDE 9

Forest Algorithms

Ambiguities in Translation

6

zi zhu zhong duan 自 助 终 端 self help terminal device

needs context to disambiguate!

slide-10
SLIDE 10

Forest Algorithms

Evil Rubbish; Safety Export

7

needs context for fluency!

slide-11
SLIDE 11

Forest Algorithms

Key Problem

8

slide-12
SLIDE 12

Forest Algorithms

Key Problem

  • How to efficiently incorporate non-local information?

8

slide-13
SLIDE 13

Forest Algorithms

Key Problem

  • How to efficiently incorporate non-local information?
  • Solution 1: pipelined reranking / rescoring
  • postpone disambiguation by propagating k-best lists
  • examples: tagging => parsing => semantics
  • (open) need efficient algorithms for k-best search

8

slide-14
SLIDE 14

Forest Algorithms

Key Problem

  • How to efficiently incorporate non-local information?
  • Solution 1: pipelined reranking / rescoring
  • postpone disambiguation by propagating k-best lists
  • examples: tagging => parsing => semantics
  • (open) need efficient algorithms for k-best search
  • Solution 2: exact joint search on a much larger space
  • examples: head/parent annotations; often intractable

8

slide-15
SLIDE 15

Forest Algorithms

Key Problem

  • How to efficiently incorporate non-local information?
  • Solution 1: pipelined reranking / rescoring
  • postpone disambiguation by propagating k-best lists
  • examples: tagging => parsing => semantics
  • (open) need efficient algorithms for k-best search
  • Solution 2: exact joint search on a much larger space
  • examples: head/parent annotations; often intractable
  • Solution 3: approximate joint search (focus of this talk)
  • (open) integrate non-local information on the fly

8

slide-16
SLIDE 16

Forest Algorithms

Outline

  • Forest: Packing Exponential Ambiguities
  • Exact k-best Search in Forest (Solution 1)
  • Approximate Joint Search with

Non-Local Features (Solution 3)

  • Forest Reranking
  • Forest Rescoring
  • Forest-based Translation (Solutions 2+3+1)
  • Tree-based Translation
  • Forest-based Decoding

9

TOP S NP PRP I VP VBD saw NP DT the NN boy PP IN with NP DT a NN telescope . .

slide-17
SLIDE 17

Forest Algorithms

Packed Forests

  • a compact representation of many parses
  • by sharing common sub-derivations
  • polynomial-space encoding of exponentially large set

10

(Klein and Manning, 2001; Huang and Chiang, 2005)

0 I 1 saw 2 him 3 with 4 a 5 mirror 6

nodes hyperedges

a hypergraph

slide-18
SLIDE 18

Forest Algorithms

Weight Functions

  • Each hyperedge e has a weight function fe
  • monotonic in each argument
  • e.g. in CKY, fe(a, b) = a x b x Pr (rule)
  • optimal subproblem property in dynamic programming
  • optimal solutions include optimal sub-solutions

11

B: b

C: c

A: f(b, c)

B: b’ ≤ b

C: c

A: f(b’, c) ≤ f(b, c)

slide-19
SLIDE 19

Forest Algorithms

1-best Viterbi on Forest

  • 1. topological sort (assumes acyclicity)
  • 2. visit each node v in sorted order and do updates
  • for each incoming hyperedge e = ((u1, .., u|e|), v, fe)
  • use d(ui)’s to update d(v)
  • key observation: d(ui)’s are fixed to optimal at this time
  • time complexity: O(V+E) = O(E) for CKY: O(n3)

12

v

u1 u2

fe

u’ d(v) ⊕ = fe(d(u1), · · · , d(u|e|))

fe’

slide-20
SLIDE 20

Forest Algorithms

Outline

  • Forest: Packing Exponential Ambiguities
  • Exact k-best Search in Forest (Solution 1)
  • Approximate Joint Search with

Non-Local Features (Solution 3)

  • Forest Reranking
  • Forest Rescoring
  • Forest-based Translation (Solutions 2+3)
  • Tree-based Translation
  • Forest-based Decoding

13

TOP S NP PRP I VP VBD saw NP DT the NN boy PP IN with NP DT a NN telescope . .

slide-21
SLIDE 21

Forest Algorithms

k-best Viterbi Algorithm 0

  • straightforward k-best extension
  • a vector of k (sorted) values for each node
  • now what’s the result of fe (a, b) ?
  • k x k = k2 possibilities! => then choose top k
  • time complexity: O(k2 E)

14

.1

a b

v

u1 u2

fe

a b

slide-22
SLIDE 22

Forest Algorithms

k-best Viterbi Algorithm 1

  • key insight: do not need to enumerate all k2
  • since vectors a and b are sorted
  • and the weight function fe is monotonic
  • (a1, b1) must be the best
  • either (a2, b1) or (a1, b2) is the 2nd-best
  • use a priority queue for the frontier
  • extract best
  • push two successors
  • time complexity: O(k log k E)

15

.1

a b

slide-23
SLIDE 23

Forest Algorithms

k-best Viterbi Algorithm 1

  • key insight: do not need to enumerate all k2
  • since vectors a and b are sorted
  • and the weight function fe is monotonic
  • (a1, b1) must be the best
  • either (a2, b1) or (a1, b2) is the 2nd-best
  • use a priority queue for the frontier
  • extract best
  • push two successors
  • time complexity: O(k log k E)

16

.1

a b

slide-24
SLIDE 24

Forest Algorithms

k-best Viterbi Algorithm 1

  • key insight: do not need to enumerate all k2
  • since vectors a and b are sorted
  • and the weight function fe is monotonic
  • (a1, b1) must be the best
  • either (a2, b1) or (a1, b2) is the 2nd-best
  • use a priority queue for the frontier
  • extract best
  • push two successors
  • time complexity: O(k log k E)

17

.1

a b

slide-25
SLIDE 25

Forest Algorithms

k-best Viterbi Algorithm 1

  • key insight: do not need to enumerate all k2
  • since vectors a and b are sorted
  • and the weight function fe is monotonic
  • (a1, b1) must be the best
  • either (a2, b1) or (a1, b2) is the 2nd-best
  • use a priority queue for the frontier
  • extract best
  • push two successors
  • time complexity: O(k log k E)

18

.1

a b

slide-26
SLIDE 26

Forest Algorithms

k-best Viterbi Algorithm 2

  • Algorithm 1 works on each hyperedge sequentially
  • O(k log k E) is still too slow for big k
  • Algorithm 2 processes all hyperedges in parallel
  • dramatic speed-up: O(E + V k log k)

19

VP1, 6

PP1, 3 VP3, 6 PP1, 4 VP4, 6 PP3, 6 VP2, 3

hyperedge

NP1, 2

slide-27
SLIDE 27

Forest Algorithms

k-best Viterbi Algorithm 3

  • Algorithm 2 computes k-best for each node
  • but we are only interested in k-best of the root node
  • Algorithm 3 computes as many as really needed
  • forward-phase
  • same as 1-best Viterbi, but stores the forest

(keeping alternative hyperedges)

  • backward-phase
  • recursively asking “what’s your 2nd-best” top-down
  • asks for more when need more

20

slide-28
SLIDE 28

Forest Algorithms

Summary of Algorithms

  • Algorithms 1 => 2 => 3
  • lazier and lazier (computation on demand)
  • larger and larger locality
  • Algorithm 3 is very fast, but requires storing forest

21

locality time space

Algorithm 1 Algorithm 2 Algorithm 3

hyperedge O( E k log k ) O(k V) node O( E + V k log k ) O(k V) global O( E + D k log k ) O(E + k D)

E - hyperedges: O(n3); V - nodes: O(n2); D - derivation: O(n)

slide-29
SLIDE 29

Forest Algorithms

Experiments - Efficiency

  • on state-of-the-art Collins/Bikel parser (Bikel, 2004)
  • average parsing time per sentence using Algs. 0, 1, 3

22

O( E + D k log k )

slide-30
SLIDE 30

Forest Algorithms

Reranking and Oracles

  • oracle - the candidate closest to the correct parse

among the k-best candidates

  • measures the potential of real reranking

23

Collins 2000

  • ur Algorithms

k

Oracle Parseval score

(turns off DP)

slide-31
SLIDE 31

Forest Algorithms

Outline

  • Packed Forests and Hypergraph Framework
  • Exact k-best Search in Forest (Solution 1)
  • Approximate Joint Search with

Non-Local Features (Solution 3)

  • Forest Reranking
  • Forest Rescoring
  • Application: Forest-based Translation
  • Tree-based Translation
  • Forest-based Decoding

24

TOP S NP PRP I VP VBD saw NP DT the NN boy PP IN with NP DT a NN telescope . .

slide-32
SLIDE 32

Forest Algorithms

Why not k-best reranking?

  • too few variations (limited scope)
  • 41% correct parses are not in ~30-best (Collins, 2000)
  • worse for longer sentences
  • too many redundancies
  • 50-best usually encodes 5-6 binary decisions (25<50<26)

25

...

slide-33
SLIDE 33

Forest Algorithms

Redundancies in n-best lists

26

(TOP (S (NP (NP (RB Not) (PDT all) (DT those)) (SBAR (WHNP (WP who)) (S (VP (VBD wrote))))) (VP (VBP oppose) (NP (DT the) (NNS changes))) (. .))) (TOP (S (RB Not) (NP (NP (PDT all) (DT those)) (SBAR (WHNP (WP who)) (S (VP (VBD wrote))))) (VP (VBP oppose) (NP (DT the) (NNS changes))) (. .))) (TOP (S (NP (NP (RB Not) (DT all) (DT those)) (SBAR (WHNP (WP who)) (S (VP (VBD wrote))))) (VP (VBP oppose) (NP (DT the) (NNS changes))) (. .))) (TOP (S (RB Not) (NP (NP (DT all) (DT those)) (SBAR (WHNP (WP who)) (S (VP (VBD wrote))))) (VP (VBP oppose) (NP (DT the) (NNS changes))) (. .))) (TOP (S (NP (NP (RB Not) (PDT all) (DT those)) (SBAR (WHNP (WP who)) (S (VP (VBD wrote))))) (VP (VB oppose) (NP (DT the) (NNS changes))) (. .))) (TOP (S (NP (NP (RB Not) (RB all) (DT those)) (SBAR (WHNP (WP who)) (S (VP (VBD wrote))))) (VP (VBP oppose) (NP (DT the) (NNS changes))) (. .))) (TOP (S (RB Not) (NP (NP (PDT all) (DT those)) (SBAR (WHNP (WP who)) (S (VP (VBD wrote))))) (VP (VB oppose) (NP (DT the) (NNS changes))) (. .)))

Not all those who wrote oppose the changes.

...

slide-34
SLIDE 34

Forest Algorithms

Redundancies in n-best lists

26

(TOP (S (NP (NP (RB Not) (PDT all) (DT those)) (SBAR (WHNP (WP who)) (S (VP (VBD wrote))))) (VP (VBP oppose) (NP (DT the) (NNS changes))) (. .))) (TOP (S (RB Not) (NP (NP (PDT all) (DT those)) (SBAR (WHNP (WP who)) (S (VP (VBD wrote))))) (VP (VBP oppose) (NP (DT the) (NNS changes))) (. .))) (TOP (S (NP (NP (RB Not) (DT all) (DT those)) (SBAR (WHNP (WP who)) (S (VP (VBD wrote))))) (VP (VBP oppose) (NP (DT the) (NNS changes))) (. .))) (TOP (S (RB Not) (NP (NP (DT all) (DT those)) (SBAR (WHNP (WP who)) (S (VP (VBD wrote))))) (VP (VBP oppose) (NP (DT the) (NNS changes))) (. .))) (TOP (S (NP (NP (RB Not) (PDT all) (DT those)) (SBAR (WHNP (WP who)) (S (VP (VBD wrote))))) (VP (VB oppose) (NP (DT the) (NNS changes))) (. .))) (TOP (S (NP (NP (RB Not) (RB all) (DT those)) (SBAR (WHNP (WP who)) (S (VP (VBD wrote))))) (VP (VBP oppose) (NP (DT the) (NNS changes))) (. .))) (TOP (S (RB Not) (NP (NP (PDT all) (DT those)) (SBAR (WHNP (WP who)) (S (VP (VBD wrote))))) (VP (VB oppose) (NP (DT the) (NNS changes))) (. .)))

Not all those who wrote oppose the changes.

packed forest ...

slide-35
SLIDE 35

Forest Algorithms

Reranking on a Forest?

  • with only local features (Solution 2)
  • dynamic programming, exact, tractable (Taskar et al. 2004;

McDonald et al., 2005)

  • with non-local features (Solution 3)
  • on-the-fly reranking at internal nodes
  • top k derivations at each node
  • use as many non-local features

as possible at each node

  • chart parsing + discriminative reranking
  • we use perceptron for simplicity

27

slide-36
SLIDE 36

Forest Algorithms

Features

  • a feature f is a function from tree y to a real number
  • f1(y)=log Pr(y) is the log Prob from generative parser
  • every other feature counts the number of times a

particular configuration occurs in y

28

instances of Rule feature f 100 (y) = f S → NP VP . (y) = 1 f 200 (y) = f NP → DT NN (y) = 2

TOP S NP PRP I VP VBD saw NP DT the NN boy PP IN with NP DT a NN telescope . .

  • ur features are from

(Charniak & Johnson, 2005) (Collins, 2000)

slide-37
SLIDE 37

Forest Algorithms

Local vs. Non-Local Features

  • a feature is local iff. it can be factored among local

productions of a tree (i.e., hyperedges in a forest)

  • local features can be pre-computed on each hyperedge

in the forest; non-locals can not

29

Rule is local ParentRule is non-local

TOP S NP PRP I VP VBD saw NP DT the NN boy PP IN with NP DT a NN telescope . .

slide-38
SLIDE 38

Forest Algorithms

Local vs. Non-Local: Examples

  • CoLenPar feature captures the difference in lengths
  • f adjacent conjuncts (Charniak and Johnson, 2005)

30

CoLenPar: 2

local!

slide-39
SLIDE 39

Forest Algorithms

Local vs. Non-Local: Examples

  • CoPar feature captures the depth to which adjacent

conjuncts are isomorphic (Charniak and Johnson, 2005)

31

CoPar: 4

non-local!

(violates DP principle)

slide-40
SLIDE 40

Forest Algorithms

Factorizing non-local features

  • going bottom-up, at each node
  • compute (partial values of) feature instances that

become computable at this level

  • postpone those uncomputable to ancestors

32

unit instance of ParentRule feature at VP node

TOP S NP PRP I VP VBD saw NP DT the NN boy PP IN with NP DT a NN telescope . .

non-local features factor across nodes dynamically local features factor across hyperedges statically

slide-41
SLIDE 41

Forest Algorithms

Factorizing non-local features

  • going bottom-up, at each node
  • compute (partial values of) feature instances that

become computable at this level

  • postpone those uncomputable to ancestors

32

unit instance of ParentRule feature at VP node

TOP S NP PRP I VP VBD saw NP DT the NN boy PP IN with NP DT a NN telescope . .

non-local features factor across nodes dynamically local features factor across hyperedges statically

slide-42
SLIDE 42

Forest Algorithms

Factorizing non-local features

  • going bottom-up, at each node
  • compute (partial values of) feature instances that

become computable at this level

  • postpone those uncomputable to ancestors

32

unit instance of ParentRule feature at VP node

TOP S NP PRP I VP VBD saw NP DT the NN boy PP IN with NP DT a NN telescope . .

non-local features factor across nodes dynamically local features factor across hyperedges statically

slide-43
SLIDE 43

Forest Algorithms

Factorizing non-local features

  • going bottom-up, at each node
  • compute (partial values of) feature instances that

become computable at this level

  • postpone those uncomputable to ancestors

33

TOP S NP PRP I VP VBD saw NP DT the NN boy PP IN with NP DT a NN telescope . .

unit instance of ParentRule feature at S node

non-local features factor across nodes dynamically local features factor across hyperedges statically

slide-44
SLIDE 44

Forest Algorithms

Factorizing non-local features

  • going bottom-up, at each node
  • compute (partial values of) feature instances that

become computable at this level

  • postpone those uncomputable to ancestors

33

TOP S NP PRP I VP VBD saw NP DT the NN boy PP IN with NP DT a NN telescope . .

unit instance of ParentRule feature at S node

non-local features factor across nodes dynamically local features factor across hyperedges statically

slide-45
SLIDE 45

Forest Algorithms

Factorizing non-local features

  • going bottom-up, at each node
  • compute (partial values of) feature instances that

become computable at this level

  • postpone those uncomputable to ancestors

33

TOP S NP PRP I VP VBD saw NP DT the NN boy PP IN with NP DT a NN telescope . .

unit instance of ParentRule feature at S node

non-local features factor across nodes dynamically local features factor across hyperedges statically

slide-46
SLIDE 46

Forest Algorithms

Factorizing non-local features

  • going bottom-up, at each node
  • compute (partial values of) feature instances that

become computable at this level

  • postpone those uncomputable to ancestors

34

TOP S NP PRP I VP VBD saw NP DT the NN boy PP IN with NP DT a NN telescope . .

unit instance of ParentRule feature at TOP node

non-local features factor across nodes dynamically local features factor across hyperedges statically

slide-47
SLIDE 47

Forest Algorithms

Factorizing non-local features

  • going bottom-up, at each node
  • compute (partial values of) feature instances that

become computable at this level

  • postpone those uncomputable to ancestors

34

TOP S NP PRP I VP VBD saw NP DT the NN boy PP IN with NP DT a NN telescope . .

unit instance of ParentRule feature at TOP node

non-local features factor across nodes dynamically local features factor across hyperedges statically

slide-48
SLIDE 48

Forest Algorithms

TOP S NP PRP I VP VBD saw NP DT the NN boy PP IN with NP DT a NN telescope . .

NGramTree (C&J 05)

  • an NGramTree captures the smallest tree fragment

that contains a bigram (two consecutive words)

  • unit instances are boundary words between subtrees

Ai,k Bi,j wi . . . wj−1 Cj,k wj . . . wk−1

unit instance of node A

35

slide-49
SLIDE 49

Forest Algorithms

TOP S NP PRP I VP VBD saw NP DT the NN boy PP IN with NP DT a NN telescope . .

NGramTree (C&J 05)

  • an NGramTree captures the smallest tree fragment

that contains a bigram (two consecutive words)

  • unit instances are boundary words between subtrees

Ai,k Bi,j wi . . . wj−1 Cj,k wj . . . wk−1

unit instance of node A

36

slide-50
SLIDE 50

Forest Algorithms

TOP S NP PRP I VP VBD saw NP DT the NN boy PP IN with NP DT a NN telescope . .

NGramTree (C&J 05)

  • an NGramTree captures the smallest tree fragment

that contains a bigram (two consecutive words)

  • unit instances are boundary words between subtrees

Ai,k Bi,j wi . . . wj−1 Cj,k wj . . . wk−1

unit instance of node A

36

slide-51
SLIDE 51

Forest Algorithms

TOP S NP PRP I VP VBD saw NP DT the NN boy PP IN with NP DT a NN telescope . .

NGramTree (C&J 05)

  • an NGramTree captures the smallest tree fragment

that contains a bigram (two consecutive words)

  • unit instances are boundary words between subtrees

Ai,k Bi,j wi . . . wj−1 Cj,k wj . . . wk−1

unit instance of node A

37

slide-52
SLIDE 52

Forest Algorithms

TOP S NP PRP I VP VBD saw NP DT the NN boy PP IN with NP DT a NN telescope . .

NGramTree (C&J 05)

  • an NGramTree captures the smallest tree fragment

that contains a bigram (two consecutive words)

  • unit instances are boundary words between subtrees

Ai,k Bi,j wi . . . wj−1 Cj,k wj . . . wk−1

unit instance of node A

37

slide-53
SLIDE 53

Forest Algorithms

Approximate Decoding

  • bottom-up, keeps top k derivations at each node
  • non-monotonic grid due to non-local features

38

Ai,k Bi,j wi . . . wj−1 Cj,k wj . . . wk−1

w・fN( ) = 0.5

1.0 3.0 8.0

1.0 1.1 3.5

2.0 + 0.5 4.0 + 5.0 9.0 + 0.5 2.1 + 0.3 4.1 + 5.4 9.1 + 0.3 4.5 + 0.6 6.5 +10.5 11.5 + 0.6

slide-54
SLIDE 54

Forest Algorithms

Approximate Decoding

  • bottom-up, keeps top k derivations at each node
  • non-monotonic grid due to non-local features

38

Ai,k Bi,j wi . . . wj−1 Cj,k wj . . . wk−1

w・fN( ) = 0.5

1.0 3.0 8.0

1.0 1.1 3.5

2.0 + 0.5 4.0 + 5.0 9.0 + 0.5 2.1 + 0.3 4.1 + 5.4 9.1 + 0.3 4.5 + 0.6 6.5 +10.5 11.5 + 0.6

slide-55
SLIDE 55

Forest Algorithms

Approximate Decoding

  • bottom-up, keeps top k derivations at each node
  • non-monotonic grid due to non-local features

39

Ai,k Bi,j wi . . . wj−1 Cj,k wj . . . wk−1

w・fN( ) = 0.5

1.0 3.0 8.0 1.0 1.1 3.5 2.5 9.0 9.5 2.4 9.5 9.4 5.1 17.0 12.1

slide-56
SLIDE 56

Forest Algorithms

Algorithm 2 => Cube Pruning

  • bottom-up, keeps top k derivations at each node
  • non-monotonic grid due to non-local features

40

Ai,k Bi,j wi . . . wj−1 Cj,k wj . . . wk−1

w・fN( ) = 0.5

1.0 3.0 8.0 1.0 1.1 3.5 2.5 9.0 9.5 2.4 9.5 9.4 5.1 17.0 12.1

slide-57
SLIDE 57

Forest Algorithms

Algorithm 2 => Cube Pruning

  • bottom-up, keeps top k derivations at each node
  • non-monotonic grid due to non-local features

41

Ai,k Bi,j wi . . . wj−1 Cj,k wj . . . wk−1

w・fN( ) = 0.5

1.0 3.0 8.0 1.0 1.1 3.5 2.5 9.0 9.5 2.4 9.5 9.4 5.1 17.0 12.1

slide-58
SLIDE 58

Forest Algorithms

Algorithm 2 => Cube Pruning

  • bottom-up, keeps top k derivations at each node
  • non-monotonic grid due to non-local features

42

Ai,k Bi,j wi . . . wj−1 Cj,k wj . . . wk−1

w・fN( ) = 0.5

1.0 3.0 8.0 1.0 1.1 3.5 2.5 9.0 9.5 2.4 9.5 9.4 5.1 17.0 12.1

slide-59
SLIDE 59

Forest Algorithms

Algorithm 2 => Cube Pruning

43

VP PP1, 3 VP3, 6 PP1, 4 VP4, 6 PP3, 6 VP2, 3

hyperedge

NP1, 2

  • process all hyperedges simultaneously!

significant savings of computation

there are search errors, but the trade-off is favorable.

slide-60
SLIDE 60

Forest Algorithms

Forest vs. k-best Oracles

  • on top of Charniak parser (modified to dump forest)
  • forests enjoy higher oracle scores than k-best lists
  • with much smaller sizes

44

96.7 97.8 97.2 98.6

slide-61
SLIDE 61

Forest Algorithms

Forest vs. k-best Oracles

  • on top of Charniak parser (modified to dump forest)
  • forests enjoy higher oracle scores than k-best lists
  • with much smaller sizes

44

96.7 97.8 97.2 98.6

slide-62
SLIDE 62

Forest Algorithms

Main Results

baseline: 1-best Charniak parser 89.72

feature extract approach training time F1% space time

50-best reranking 4 x 0.3h 91.43 2.4G 19h 100-best reranking 4 x 0.7h 91.49 5.3G 44h forest reranking 4 x 6.1h 91.69 1.2G 2.9h

  • forest reranking beats 50-best & 100-best reranking
  • can be trained on the whole treebank in ~1 day even

with a pure Python implementation!

  • most previous work only scaled to short sentences

(<=15 words) and local features

45

slide-63
SLIDE 63

Forest Algorithms

Main Results

baseline: 1-best Charniak parser 89.72

feature extract approach training time F1% space time

50-best reranking 4 x 0.3h 91.43 2.4G 19h 100-best reranking 4 x 0.7h 91.49 5.3G 44h forest reranking 4 x 6.1h 91.69 1.2G 2.9h

  • forest reranking beats 50-best & 100-best reranking
  • can be trained on the whole treebank in ~1 day even

with a pure Python implementation!

  • most previous work only scaled to short sentences

(<=15 words) and local features

45

slide-64
SLIDE 64

Forest Algorithms

Comparison with Others

46

type system

F1%

D

G

S

Collins (2000) 89.7 Charniak and Johnson (2005) 91.0 updated (2006) 91.4 Petrov and Klein (2008) 88.3 this work 91.7 Carreras et al. (2008) 91.1 Bod (2000) 90.7 Petrov and Klein (2007) 90.1 McClosky et al. (2006) 92.1

best accuracy to date on the Penn Treebank, and fast training

n-best reranking dynamic programming semi- supervised

slide-65
SLIDE 65
  • n to Machine Translation...

applying the same ideas of non-locality...

slide-66
SLIDE 66

Forest Algorithms

Translate Server Error

48

slide-67
SLIDE 67

Forest Algorithms

Translate Server Error

48

clear evidence that MT is used in real life.

slide-68
SLIDE 68

Forest Algorithms

Context in Translation

49

slide-69
SLIDE 69

Forest Algorithms

Context in Translation

49

Algorithm 2 => cube pruning fluency problem (n-gram)

slide-70
SLIDE 70

Forest Algorithms

Context in Translation

49

xiaoxin

小心 X <=> be careful not to X

syntax problem (SCFG)

Algorithm 2 => cube pruning fluency problem (n-gram)

slide-71
SLIDE 71

Forest Algorithms

Context in Translation

49

xiaoxin

小心 X <=> be careful not to X

syntax problem (SCFG)

Algorithm 2 => cube pruning fluency problem (n-gram)

xiaoxin gou

小心 狗 <=> be aware of dog

slide-72
SLIDE 72

Forest Algorithms

Context in Translation

49

xiaoxin

小心 X <=> be careful not to X

syntax problem (SCFG)

Algorithm 2 => cube pruning fluency problem (n-gram)

小心 VP <=> be careful not to VP 小心 NP <=> be careful of NP

xiaoxin gou

小心 狗 <=> be aware of dog

slide-73
SLIDE 73

Forest Algorithms

How do people translate?

  • 1. understand the source language sentence
  • 2. generate the target language translation

50

Bush hold

and/ with

meeting Sharon

[past.]

布什 举行

会谈 沙龙

Bùshí juxíng

yu

huìtán Shalóng

le

slide-74
SLIDE 74

Forest Algorithms

How do people translate?

  • 1. understand the source language sentence
  • 2. generate the target language translation

50

Bush hold

and/ with

meeting Sharon

[past.]

布什 举行

会谈 沙龙

Bùshí juxíng

yu

huìtán Shalóng

le

slide-75
SLIDE 75

Forest Algorithms

How do people translate?

  • 1. understand the source language sentence
  • 2. generate the target language translation

50

Bush hold

and/ with

meeting Sharon

[past.]

“Bush held a meeting with Sharon” 布什 举行

会谈 沙龙

Bùshí juxíng

yu

huìtán Shalóng

le

slide-76
SLIDE 76

Forest Algorithms

How do compilers translate?

  • 1. parse high-level language program into a syntax tree
  • 2. generate intermediate or machine code accordingly

51

x3 = y + 3;

slide-77
SLIDE 77

Forest Algorithms

How do compilers translate?

  • 1. parse high-level language program into a syntax tree
  • 2. generate intermediate or machine code accordingly

51

x3 = y + 3;

slide-78
SLIDE 78

Forest Algorithms

How do compilers translate?

  • 1. parse high-level language program into a syntax tree
  • 2. generate intermediate or machine code accordingly

51

x3 = y + 3;

LD R1, id2 ADDF R1, R1, #3.0 // add float RTOI R2, R1 // real to int ST id1, R2

slide-79
SLIDE 79

Forest Algorithms

How do compilers translate?

  • 1. parse high-level language program into a syntax tree
  • 2. generate intermediate or machine code accordingly

51

x3 = y + 3;

LD R1, id2 ADDF R1, R1, #3.0 // add float RTOI R2, R1 // real to int ST id1, R2

syntax-directed translation (~1960)

slide-80
SLIDE 80

Forest Algorithms

Syntax-Directed Machine Translation

  • get 1-best parse tree; then convert to English

52

Bush hold

and/ with

meeting Sharon [past.] “Bush held a meeting with Sharon”

slide-81
SLIDE 81

Forest Algorithms

Syntax-Directed Machine Translation

53

  • recursive rewrite by pattern-matching

(Galley et al. 2004; Liu et al., 2006; Huang, Knight, Joshi 2006)

slide-82
SLIDE 82

Forest Algorithms

Syntax-Directed Machine Translation

53

  • recursive rewrite by pattern-matching

(Galley et al. 2004; Liu et al., 2006; Huang, Knight, Joshi 2006)

slide-83
SLIDE 83

Forest Algorithms

Syntax-Directed Machine Translation

53

  • recursive rewrite by pattern-matching

(Galley et al. 2004; Liu et al., 2006; Huang, Knight, Joshi 2006)

slide-84
SLIDE 84

Forest Algorithms

Syntax-Directed Machine Translation

53

  • recursive rewrite by pattern-matching

with

(Galley et al. 2004; Liu et al., 2006; Huang, Knight, Joshi 2006)

slide-85
SLIDE 85

Forest Algorithms

Syntax-Directed Machine Translation

  • recursively solve unfinished subproblems

54

with

(Galley et al. 2004; Liu et al., 2006; Huang, Knight, Joshi 2006)

slide-86
SLIDE 86

Forest Algorithms

Syntax-Directed Machine Translation

  • recursively solve unfinished subproblems

54

with

(Galley et al. 2004; Liu et al., 2006; Huang, Knight, Joshi 2006)

slide-87
SLIDE 87

Forest Algorithms

Syntax-Directed Machine Translation

  • recursively solve unfinished subproblems

54

with Bush

(Galley et al. 2004; Liu et al., 2006; Huang, Knight, Joshi 2006)

slide-88
SLIDE 88

Forest Algorithms

Syntax-Directed Machine Translation

  • recursively solve unfinished subproblems

54

held with Bush

(Galley et al. 2004; Liu et al., 2006; Huang, Knight, Joshi 2006)

slide-89
SLIDE 89

Forest Algorithms

Syntax-Directed Machine Translation

  • continue pattern-matching

55

Bush held with

(Galley et al. 2004; Liu et al., 2006; Huang, Knight, Joshi 2006)

slide-90
SLIDE 90

Forest Algorithms

Syntax-Directed Machine Translation

  • continue pattern-matching

55

Bush held with a meeting Sharon

(Galley et al. 2004; Liu et al., 2006; Huang, Knight, Joshi 2006)

slide-91
SLIDE 91

Forest Algorithms

Syntax-Directed Machine Translation

  • continue pattern-matching

56

Bush held with a meeting Sharon

slide-92
SLIDE 92

Forest Algorithms

Syntax-Directed Machine Translation

  • continue pattern-matching

56

this method is simple, fast, and expressive. but... crucial difference between PL and NL: ambiguity! using 1-best parse causes error propagation! idea: use k-best parses? use a parse forest!

Bush held with a meeting Sharon

slide-93
SLIDE 93

Forest Algorithms

Forest-based Translation

57

“and” / “with”

slide-94
SLIDE 94

Forest Algorithms

Forest-based Translation

58

pattern-matching on forest

“and” / “with”

slide-95
SLIDE 95

Forest Algorithms

“and”

Forest-based Translation

58

pattern-matching on forest

“and” / “with”

slide-96
SLIDE 96

Forest Algorithms

“and”

Forest-based Translation

58

pattern-matching on forest

“and” / “with”

slide-97
SLIDE 97

Forest Algorithms

“and”

Forest-based Translation

58

pattern-matching on forest

“and” / “with”

slide-98
SLIDE 98

Forest Algorithms

“and”

Forest-based Translation

58

pattern-matching on forest

“and” / “with”

slide-99
SLIDE 99

Forest Algorithms

“and”

Forest-based Translation

58

pattern-matching on forest

“and” / “with”

slide-100
SLIDE 100

Forest Algorithms

“and”

Forest-based Translation

58

pattern-matching on forest

“and” / “with”

directed by underspecified syntax

slide-101
SLIDE 101

Forest Algorithms

Translation Forest

59

slide-102
SLIDE 102

Forest Algorithms

Translation Forest

59

slide-103
SLIDE 103

Forest Algorithms

Translation Forest

59

“held a meeting”

“Sharon” “Bush”

slide-104
SLIDE 104

Forest Algorithms

Translation Forest

59

“held a meeting”

“Sharon” “Bush”

“Bush held a meeting with Sharon”

slide-105
SLIDE 105

Forest Algorithms

The Whole Pipeline

60

parse forest translation forest translation+LM forest

parser pattern-matching w/ translation rules (exact) Algorithm 2 => cube pruning (approx.) Algorithm 3 (exact) best derivation

(Huang and Chiang, 2005; 2007; Chiang, 2007)

input sentence 1-best translation k-best translations

slide-106
SLIDE 106

Forest Algorithms

The Whole Pipeline

60

parse forest translation forest translation+LM forest

parser pattern-matching w/ translation rules (exact) Algorithm 2 => cube pruning (approx.) Algorithm 3 (exact) best derivation

packed forests

(Huang and Chiang, 2005; 2007; Chiang, 2007)

input sentence 1-best translation k-best translations

slide-107
SLIDE 107

Forest Algorithms

k-best trees vs. forest-based

61

1.7 Bleu improvement over 1-best, 0.8 over 30-best, and even faster!

slide-108
SLIDE 108

Forest Algorithms

forest as virtual ∞-best list

  • how often is the ith-best tree picked by the decoder?

62

32% beyond 100-best 20% beyond 1000-best 1000

suggested by Mark Johnson

slide-109
SLIDE 109

Forest Algorithms

Larger Decoding Experiments

  • 2.2M sentence pairs (57M Chinese and 62M English words)
  • larger trigram models (1/3 of Xinhua Gigaword)
  • also use bilingual phrases (BP) as flat translation rules
  • phrases that are consistent with syntactic constituents
  • forest enables larger improvement with BP

63

T2S T2S+BP 1-best tree 30-best trees forest

improvement

0.2666 0.2939 0.2755 0.3084 0.2839 0.3149 1.7

2.1

slide-110
SLIDE 110

Forest Algorithms

Conclusions: Dynamic Programming

  • A general framework of DP on monotonic hypergraphs
  • Exact k-best DP algorithms (monotonic)
  • Approximate DP with non-local features (non-monotonic)
  • Forest Reranking for discriminative parsing
  • Forest Rescoring for MT decoding
  • Forest-based Translation
  • translates a parse forest of millions of trees
  • even faster than translating top-30 trees (and better)
  • Future Directions: even faster search with richer info...

64

slide-111
SLIDE 111

Forest is your friend. Save the forest. Thank you!

slide-112
SLIDE 112

Forest Algorithms

Global Feature - RightBranch

66

  • length of rightmost (non-punctuation) path
  • English has a right-branching tendency

(Charniak and Johnson, 2005)

can not be factored anywhere have to wait till root (punctuation or not is ambiguous: ’: possessive or right quote?)