forest based search algorithms
play

Forest-Based Search Algorithms for Parsing and Machine Translation - PowerPoint PPT Presentation

Forest-Based Search Algorithms for Parsing and Machine Translation Liang Huang University of Pennsylvania Google Research, March 14th, 2008 Search in NLP is not trivial! I saw her duck. Aravind Joshi 2 Search in NLP is not trivial!


  1. Outline • Packed Forests and Hypergraph Framework • Exact k-best Search in the Forest (Solution 1) • Approximate Joint Search (Solution 2) with Non-Local Features TOP S • Forest Reranking NP VP . • Machine Translation PRP VBD NP PP . I saw DT NN IN NP • Decoding w/ Language Models the boy with DT NN a telescope bigram • Forest Rescoring held ... talk with ... Sharon • Future Directions VP 3, 6 PP 1, 3 25

  2. Why n -best reranking is bad? ... • too few variations (limited scope) • 41% correct parses are not in ~30-best (Collins, 2000) • worse for longer sentences • too many redundancies • 50-best usually encodes 5-6 binary decisions (2 5 <50<2 6 ) 26

  3. Reranking on a Forest? • with only local features • dynamic programming, tractable (Taskar et al. 2004; McDonald et al., 2005) • with non-local features • on-the-fly reranking at internal nodes • top k derivations at each node • use as many non-local features as possible at each node • chart parsing + discriminative reranking • we use perceptron for simplicity 27

  4. Generic Reranking by Perceptron • for each sentence s i , we have a set of candidates cand ( s i ) • and an oracle tree y i+ , among the candidates • a feature mapping from tree y to vector f ( y ) “decoder” feature representation (Collins, 2002) 28

  5. Features • a feature f is a function from tree y to a real number • f 1 ( y )=log Pr( y ) is the log Prob from generative parser • every other feature counts the number of times a particular configuration occurs in y our features are from TOP (Charniak & Johnson, 2005) S (Collins, 2000) NP VP . instances of Rule feature PRP VBD NP PP . f 100 ( y ) = f S → NP VP . ( y ) = 1 I saw DT NN IN NP f 200 ( y ) = f NP → DT NN ( y ) = 2 the boy with DT NN a telescope 29

  6. Local vs. Non-Local Features • a feature is local iff. it can be factored among local productions of a tree (i.e., hyperedges in a forest) • local features can be pre-computed on each hyperedge in the forest; non-locals can not TOP ParentRule is non-local S NP VP . PRP VBD NP PP . Rule is local I saw DT NN IN NP the boy with DT NN a telescope 30

  7. WordEdges (C&J 05) • a WordEdges feature classifies a node by its label, (binned) span length, and surrounding words • a POSEdges feature uses surrounding POS tags TOP f 400 ( y ) = f NP 2 saw with ( y ) = 1 S NP VP . PRP VBD NP PP . I saw DT NN IN NP the boy with DT NN 2 words a telescope 31

  8. WordEdges (C&J 05) • a WordEdges feature classifies a node by its label, (binned) span length, and surrounding words • a POSEdges feature uses surrounding POS tags WordEdges is local TOP f 400 ( y ) = f NP 2 saw with ( y ) = 1 S NP VP . PRP VBD NP PP . I saw DT NN IN NP the boy with DT NN 2 words a telescope 31

  9. WordEdges (C&J 05) • a WordEdges feature classifies a node by its label, (binned) span length, and surrounding words • a POSEdges feature uses surrounding POS tags WordEdges is local TOP f 400 ( y ) = f NP 2 saw with ( y ) = 1 S NP VP . PRP VBD NP PP . I saw DT NN IN NP the boy with DT NN 2 words a telescope 31

  10. WordEdges (C&J 05) • a WordEdges feature classifies a node by its label, (binned) span length, and surrounding words • a POSEdges feature uses surrounding POS tags WordEdges is local TOP f 400 ( y ) = f NP 2 saw with ( y ) = 1 S NP VP . POSEdges is non-local PRP VBD NP PP . f 800 ( y ) = f NP 2 VBD IN ( y ) = 1 I saw DT NN IN NP the boy with DT NN 2 words a telescope 31

  11. WordEdges (C&J 05) • a WordEdges feature classifies a node by its label, (binned) span length, and surrounding words • a POSEdges feature uses surrounding POS tags WordEdges is local TOP f 400 ( y ) = f NP 2 saw with ( y ) = 1 S NP VP . POSEdges is non-local PRP VBD NP PP . f 800 ( y ) = f NP 2 VBD IN ( y ) = 1 I saw DT NN IN NP the boy with DT NN local features comprise 2 words ~70% of all instances! a telescope 31

  12. Factorizing non-local features • going bottom-up, at each node • compute (partial values of) feature instances that become computable at this level • postpone those uncomputable to ancestors TOP unit instance of ParentRule S feature at the TOP node NP VP . PRP VBD NP PP . I saw DT NN IN NP the boy with DT NN a telescope 32

  13. Factorizing non-local features • going bottom-up, at each node • compute (partial values of) feature instances that become computable at this level • postpone those uncomputable to ancestors TOP unit instance of ParentRule S feature at the TOP node NP VP . PRP VBD NP PP . I saw DT NN IN NP the boy with DT NN a telescope 32

  14. Factorizing non-local features • going bottom-up, at each node • compute (partial values of) feature instances that become computable at this level • postpone those uncomputable to ancestors TOP unit instance of ParentRule S feature at the TOP node NP VP . PRP VBD NP PP . I saw DT NN IN NP the boy with DT NN a telescope 32

  15. NGramTree (C&J 05) • an NGramTree captures the smallest tree fragment that contains a bigram (two consecutive words) • unit instances are boundary words between subtrees A i,k TOP S B i,j C j,k NP VP . w i . . . w j − 1 w j . . . w k − 1 PRP VBD NP PP . unit instance of node A I saw DT NN IN NP the boy with DT NN a telescope 33

  16. NGramTree (C&J 05) • an NGramTree captures the smallest tree fragment that contains a bigram (two consecutive words) • unit instances are boundary words between subtrees A i,k TOP S B i,j C j,k NP VP . w i . . . w j − 1 w j . . . w k − 1 PRP VBD NP PP . unit instance of node A I saw DT NN IN NP the boy with DT NN a telescope Forest Reranking

  17. NGramTree (C&J 05) • an NGramTree captures the smallest tree fragment that contains a bigram (two consecutive words) • unit instances are boundary words between subtrees A i,k TOP S B i,j C j,k NP VP . w i . . . w j − 1 w j . . . w k − 1 PRP VBD NP PP . unit instance of node A I saw DT NN IN NP the boy with DT NN a telescope Forest Reranking

  18. NGramTree (C&J 05) • an NGramTree captures the smallest tree fragment that contains a bigram (two consecutive words) • unit instances are boundary words between subtrees A i,k TOP S B i,j C j,k NP VP . w i . . . w j − 1 w j . . . w k − 1 PRP VBD NP PP . unit instance of node A I saw DT NN IN NP the boy with DT NN a telescope Forest Reranking

  19. NGramTree (C&J 05) • an NGramTree captures the smallest tree fragment that contains a bigram (two consecutive words) • unit instances are boundary words between subtrees A i,k TOP S B i,j C j,k NP VP . w i . . . w j − 1 w j . . . w k − 1 PRP VBD NP PP . unit instance of node A I saw DT NN IN NP the boy with DT NN a telescope Forest Reranking

  20. NGramTree (C&J 05) • an NGramTree captures the smallest tree fragment that contains a bigram (two consecutive words) • unit instances are boundary words between subtrees A i,k TOP S B i,j C j,k NP VP . w i . . . w j − 1 w j . . . w k − 1 PRP VBD NP PP . unit instance of node A I saw DT NN IN NP the boy with DT NN a telescope Forest Reranking

  21. NGramTree (C&J 05) • an NGramTree captures the smallest tree fragment that contains a bigram (two consecutive words) • unit instances are boundary words between subtrees A i,k TOP S B i,j C j,k NP VP . w i . . . w j − 1 w j . . . w k − 1 PRP VBD NP PP . unit instance of node A I saw DT NN IN NP the boy with DT NN a telescope Forest Reranking

  22. NGramTree (C&J 05) • an NGramTree captures the smallest tree fragment that contains a bigram (two consecutive words) • unit instances are boundary words between subtrees A i,k TOP S B i,j C j,k NP VP . w i . . . w j − 1 w j . . . w k − 1 PRP VBD NP PP . unit instance of node A I saw DT NN IN NP the boy with DT NN a telescope Forest Reranking

  23. NGramTree (C&J 05) • an NGramTree captures the smallest tree fragment that contains a bigram (two consecutive words) • unit instances are boundary words between subtrees A i,k TOP S B i,j C j,k NP VP . w i . . . w j − 1 w j . . . w k − 1 PRP VBD NP PP . unit instance of node A I saw DT NN IN NP the boy with DT NN a telescope Forest Reranking

  24. NGramTree (C&J 05) • an NGramTree captures the smallest tree fragment that contains a bigram (two consecutive words) • unit instances are boundary words between subtrees A i,k TOP S B i,j C j,k NP VP . w i . . . w j − 1 w j . . . w k − 1 PRP VBD NP PP . unit instance of node A I saw DT NN IN NP the boy with DT NN a telescope Forest Reranking

  25. Heads (C&J 05, Collins 00) • head-to-head lexical dependencies • we percolate heads bottom-up • unit instances are between the head word of the head child and the head words of non-head children TOP/ saw S/ saw NP/ I VP/ saw ./ . PRP/ I VBD/ saw NP/ the PP/ with . NN/ boy I saw DT/ the IN/ with NP/ a NN/ telescope the boy with DT/ a a telescope 35

  26. Heads (C&J 05, Collins 00) • head-to-head lexical dependencies • we percolate heads bottom-up • unit instances are between the head word of the head child and the head words of non-head children TOP/ saw S/ saw NP/ I VP/ saw ./ . PRP/ I VBD/ saw NP/ the PP/ with . NN/ boy I saw DT/ the IN/ with NP/ a NN/ telescope the boy with DT/ a a telescope 35

  27. Heads (C&J 05, Collins 00) • head-to-head lexical dependencies • we percolate heads bottom-up • unit instances are between the head word of the head child and the head words of non-head children TOP/ saw unit instances at VP node saw - the ; saw - with S/ saw NP/ I VP/ saw ./ . PRP/ I VBD/ saw NP/ the PP/ with . NN/ boy I saw DT/ the IN/ with NP/ a NN/ telescope the boy with DT/ a a telescope 35

  28. Approximate Decoding • bottom-up, keeps top k derivations at each node • non-monotonic grid due to non-local features w · f N ( ) = 0.5 A i,k 1.0 3.0 8.0 B i,j C j,k 1.0 2.0 + 0.5 4.0 + 5.0 9.0 + 0.5 w i . . . w j − 1 w j . . . w k − 1 1.1 2.1 + 0.3 4.1 + 5.4 9.1 + 0.3 3.5 4.5 + 0.6 6.5 + 10.5 11.5 + 0.6 36

  29. Approximate Decoding • bottom-up, keeps top k derivations at each node • non-monotonic grid due to non-local features w · f N ( ) = 0.5 A i,k 1.0 3.0 8.0 B i,j C j,k 1.0 2.0 + 0.5 4.0 + 5.0 9.0 + 0.5 w i . . . w j − 1 w j . . . w k − 1 1.1 2.1 + 0.3 4.1 + 5.4 9.1 + 0.3 3.5 4.5 + 0.6 6.5 + 10.5 11.5 + 0.6 36

  30. Approximate Decoding • bottom-up, keeps top k derivations at each node • non-monotonic grid due to non-local features w · f N ( ) = 0.5 A i,k 1.0 3.0 8.0 B i,j C j,k 1.0 2.5 9.0 9.5 w i . . . w j − 1 w j . . . w k − 1 1.1 2.4 9.5 9.4 3.5 5.1 17.0 12.1 37

  31. Approximate Decoding • bottom-up, keeps top k derivations at each node • non-monotonic grid due to non-local features • priority queue for next-best • each iteration pops the best and pushes successors • extract unit non-local features on-the-fly A i,k 1.0 3.0 8.0 B i,j C j,k 1.0 2.5 9.0 9.5 1.1 2.4 9.5 9.4 w i . . . w j − 1 w j . . . w k − 1 3.5 5.1 17.0 12.1 38

  32. Algorithm 2 => Cube Pruning • process all hyperedges simultaneously! significant savings of computation VP hyperedge PP 1, 3 VP 3, 6 PP 1, 4 VP 4, 6 NP 1, 2 VP 2, 3 PP 3, 6 bottom-neck: the time for on-the-fly non-local feature extraction 39

  33. Forest vs. n-best Oracles • on top of Charniak parser (modified to dump forest) • forests enjoy higher oracle scores than n-best lists • with much smaller sizes 98.6 97.8 97.2 96.8 40

  34. Main Results • pre-comp. is for feature-extraction (can be parallelized) • # of training iterations is determined on the dev set • forest reranking outperforms both 50- and 100-best baseline: 1-best Charniak parser 89.72 features n or k pre-comp. training F 1 % local 50 1.4G / 25h 1 x 0.3h 91.01 all 50 2.4G / 34h 5 x 0.5h 91.43 all 100 5.3G / 77h 5 x 1.3h 91.47 local - 3 x 1.4h 91.25 1.2G / 5.1h all 4 x 11h 91.69 k =15 41

  35. Comparison with Others type system F 1 % Collins (2000) 89.7 Henderson (2004) 90.1 Charniak and Johnson (2005) 91.0 D updated (2006) 91.4 Petrov and Klein (2008) 88.3 this work 91.7 Bod (2000) 90.7 G Petrov and Klein (2007) 90.1 S McClosky et al. (2006) 92.1 best accuracy to date on the Penn Treebank 42

  36. Outline • Packed Forests and Hypergraph Framework • Exact k -best Search in the Forest TOP • Approximate Joint Search S with Non-Local Features NP VP . PRP VBD NP PP . • Forest Reranking I saw DT NN IN NP • Machine Translation the boy with DT NN a telescope • Decoding w/ Language Models bigram • Forest Rescoring held ... talk with ... Sharon • Future Directions VP 3, 6 PP 1, 3 43

  37. Statistical Machine Translation Spanish/English English Bilingual Text Text Statistical Analysis Statistical Analysis translation model (TM) language model (LM) Broken Spanish English competency fluency English What hunger have I Hungry I am so Have I that hunger Que hambre tengo yo I am so hungry I am so hungry How hunger have I ... (Knight and Koehn, 2003) 44

  38. Statistical Machine Translation Spanish/English English Bilingual Text Text Statistical Analysis Statistical Analysis translation model (TM) language model (LM) Broken Spanish English competency fluency English k -best rescoring (Algorithm 3) What hunger have I Hungry I am so Have I that hunger Que hambre tengo yo I am so hungry I am so hungry How hunger have I ... (Knight and Koehn, 2003) 44

  39. Statistical Machine Translation Spanish/English English Bilingual Text Text Statistical Analysis Statistical Analysis phrase-based TM translation model (TM) language model (LM) Broken n -gram LM Spanish English competency fluency English syntax-based decoder integrated decoder I am so hungry Que hambre tengo yo (LM-integrated) computationally challenging! ☹ 45

  40. Forest Rescoring Spanish/English English Bilingual Text Text Statistical Analysis Statistical Analysis phrase-based TM translation model (TM) language model (LM) Broken n -gram LM Spanish English competency fluency English syntax-based packed forest as non-local info decoder integrated decoder forest rescorer I am so hungry Que hambre tengo yo (LM-integrated) 46

  41. Syntax-based Translation • synchronous context-free grammars (SCFGs) • context-free grammar in two dimensions • generating pairs of strings/trees simultaneously • co-indexed nonterminal further rewritten as a unit PP (1) VP (2) , VP (2) PP (1) VP → VP juxing le huitan , held a meeting → PP yu Shalong , with Sharon → VP VP PP VP VP PP yu Shalong juxing le huitan held a meeting with Sharon 47

  42. Translation as Parsing • translation with SCFGs => monolingual parsing • parse the source input with the source projection • build the corresponding target sub-strings in parallel PP (1) VP (2) , VP (2) PP (1) VP → VP juxing le huitan , held a meeting → PP yu Shalong , with Sharon → VP 1, 6 VP 3, 6 PP 1, 3 yu Shalong juxing le huitan 48

  43. Translation as Parsing • translation with SCFGs => monolingual parsing • parse the source input with the source projection • build the corresponding target sub-strings in parallel PP (1) VP (2) , VP (2) PP (1) VP → VP juxing le huitan , held a meeting → held a talk with Sharon PP yu Shalong , with Sharon → VP 1, 6 with Sharon held a talk VP 3, 6 PP 1, 3 yu Shalong juxing le huitan 48

  44. Adding a Bigram Model • exact dynamic programming • nodes now split into +LM items • with English boundary words • search space too big for exact search • beam search: keep at most k +LM items each node • but can we do better? held ... Sharon S 1, 6 bigram held ... talk with ... Sharon +LM items VP 3, 6 PP 1, 3 49

  45. Non-Monotonic Grid ) ) ) n g n VP 1, 6 n o o r o r a a l a h h h S S S � � � g h h n t t o i VP 3, 6 PP 1, 3 i w l w 3 a 3 3 , , , 1 1 1 P P P P P P ( ( ( non-monotonicity due to LM combo costs 1.0 3.0 8.0 (VP held � meeting 1.0 2.0 + 0.5 4.0 + 5.0 9.0 + 0.5 ) 3,6 (VP held � talk 1.1 2.1 + 0.3 4.1 + 5.4 9.1 + 0.3 ) 3,6 (VP hold � conference 3.5 4.5 + 0.6 6.5 + 10.5 11.5 + 0.6 ) 3,6 50

  46. Non-Monotonic Grid ) ) ) n g n VP 1, 6 n o o r o r a a l a h h h S S S bigram ( meeting, with ) � � � g h h n t t o i VP 3, 6 PP 1, 3 i w l w 3 a 3 3 , , , 1 1 1 P P P P P P ( ( ( non-monotonicity due to LM combo costs 1.0 3.0 8.0 (VP held � meeting 1.0 2.0 + 0.5 4.0 + 5.0 9.0 + 0.5 ) 3,6 (VP held � talk 1.1 2.1 + 0.3 4.1 + 5.4 9.1 + 0.3 ) 3,6 (VP hold � conference 3.5 4.5 + 0.6 6.5 + 10.5 11.5 + 0.6 ) 3,6 50

  47. Algorithm 2 -Cube Pruning ) ) ) n g n VP 1, 6 n o o r o r a a l a h h h S S S � � � g h h n t t o i VP 3, 6 PP 1, 3 i w l w 3 a 3 3 , , , 1 1 1 P P P P P P ( ( ( 1.0 3.0 8.0 (VP held � meeting 1.0 2.5 9.0 9.5 ) 3,6 (VP held � talk 1.1 2.4 9.5 9.4 ) 3,6 (VP hold � conference 3.5 5.1 17.0 12.1 ) 3,6 51

  48. Algorithm 2 => Cube Pruning k -best Algorithm 2, with search errors VP hyperedge PP 1, 3 VP 3, 6 PP 1, 4 VP 4, 6 NP 1, 4 VP 4, 6 process all hyperedges simultaneously! significant savings of computation 52

  49. Phrase-based: Translation Accuracy speed ++ ~100 times faster quality++ Algorithm 2: 53

  50. Syntax-based: Translation Accuracy speed ++ quality++ Algorithm 2: Algorithm 3: 54

  51. Conclusion so far • General framework of DP on hypergraphs • monotonicity => exact 1-best algorithm • Exact k -best algorithms • Approximate search with non-local information • Forest Reranking for discriminative parsing • Forest Rescoring for MT decoding • Empirical Results • orders of magnitudes faster than previous methods • best Treebank parsing accuracy to date 55

  52. Impact • These algorithms have been widely implemented in • state-of-the-art parsers • Charniak parser • McDonald’s dependency parser • MIT parser (Collins/Koo), Berkeley and Stanford parsers • DOP parsers (Bod, 2006/7) • major statistical MT systems • Syntax-based systems from ISI, CMU, BBN, ... • Phrase-based system: Moses [underway] 56

  53. Future Directions

  54. Further work on Forest Reranking • Better Decoding Algorithms • pre-compute most non-local features • use Algorithm 3 cube growing • intra-sentence level parallelized decoding • Combination with Semi-supervised Learning • easy to apply to self-training (McClosky et al., 2006) • Deeper and deeper Decoding (e.g., semantic roles) • Other Machine Learning Algorithms • Theoretical and Empirical Analysis of Search Errors 58

  55. Machine Translation / Generation • Discriminative training using non-local features • local-features showed modest improvement on phrase-base systems (Liang et al., 2006) • plan for syntax-based (tree-to-string) systems • fast, linear-time decoding • Using packed parse forest for • tree-to-string decoding (Mi, Huang, Liu, 2008) • rule extraction (tree-to-tree) • Generation / Summarization: non-local constraints 59

  56. Thanks! THE END - Thanks! Questions? Comments? 60

  57. Speed vs. Search Quality tested on our faithful clone of Pharaoh speed ++ ( - log Prob ) quality ++ Huang and Chiang Forest Rescoring 61

  58. Speed vs. Search Quality tested on our faithful clone of Pharaoh speed ++ ( - log Prob ) quality ++ 32 times faster Huang and Chiang Forest Rescoring 61

  59. Speed vs. Search Quality tested on our faithful clone of Pharaoh speed ++ same parameters ( - log Prob ) quality ++ 32 times faster Huang and Chiang Forest Rescoring 61

  60. Syntax-based: Search Quality speed ++ ( - log Prob ) quality ++ 10 times faster 62

  61. Tree-to-String System • syntax-directed, English to Chinese (Huang, Knight, Joshi, 2006) • first parse input, and then recursively transfer synchronous tree- substitution grammars (STSG) VP (Galley et al., 2004; Eisner, 2003) VBD VP-C extended to translate was VP PP a packed-forest VBN PP IN NP-C instead of a tree (Mi, Huang, Liu, 2008) shot TO NP-C by DT NN to NN the police death 63

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend