Forest-Based Search Algorithms for Parsing and Machine Translation - PowerPoint PPT Presentation

Outline • Packed Forests and Hypergraph Framework • Exact k-best Search in the Forest (Solution 1) • Approximate Joint Search (Solution 2) with Non-Local Features TOP S • Forest Reranking NP VP . • Machine Translation PRP VBD NP PP . I saw DT NN IN NP • Decoding w/ Language Models the boy with DT NN a telescope bigram • Forest Rescoring held ... talk with ... Sharon • Future Directions VP 3, 6 PP 1, 3 25

Why n -best reranking is bad? ... • too few variations (limited scope) • 41% correct parses are not in ~30-best (Collins, 2000) • worse for longer sentences • too many redundancies • 50-best usually encodes 5-6 binary decisions (2 5 <50<2 6 ) 26

Reranking on a Forest? • with only local features • dynamic programming, tractable (Taskar et al. 2004; McDonald et al., 2005) • with non-local features • on-the-fly reranking at internal nodes • top k derivations at each node • use as many non-local features as possible at each node • chart parsing + discriminative reranking • we use perceptron for simplicity 27

Generic Reranking by Perceptron • for each sentence s i , we have a set of candidates cand ( s i ) • and an oracle tree y i+ , among the candidates • a feature mapping from tree y to vector f ( y ) “decoder” feature representation (Collins, 2002) 28

Features • a feature f is a function from tree y to a real number • f 1 ( y )=log Pr( y ) is the log Prob from generative parser • every other feature counts the number of times a particular configuration occurs in y our features are from TOP (Charniak & Johnson, 2005) S (Collins, 2000) NP VP . instances of Rule feature PRP VBD NP PP . f 100 ( y ) = f S → NP VP . ( y ) = 1 I saw DT NN IN NP f 200 ( y ) = f NP → DT NN ( y ) = 2 the boy with DT NN a telescope 29

Local vs. Non-Local Features • a feature is local iff. it can be factored among local productions of a tree (i.e., hyperedges in a forest) • local features can be pre-computed on each hyperedge in the forest; non-locals can not TOP ParentRule is non-local S NP VP . PRP VBD NP PP . Rule is local I saw DT NN IN NP the boy with DT NN a telescope 30

WordEdges (C&J 05) • a WordEdges feature classifies a node by its label, (binned) span length, and surrounding words • a POSEdges feature uses surrounding POS tags TOP f 400 ( y ) = f NP 2 saw with ( y ) = 1 S NP VP . PRP VBD NP PP . I saw DT NN IN NP the boy with DT NN 2 words a telescope 31

WordEdges (C&J 05) • a WordEdges feature classifies a node by its label, (binned) span length, and surrounding words • a POSEdges feature uses surrounding POS tags WordEdges is local TOP f 400 ( y ) = f NP 2 saw with ( y ) = 1 S NP VP . PRP VBD NP PP . I saw DT NN IN NP the boy with DT NN 2 words a telescope 31

WordEdges (C&J 05) • a WordEdges feature classifies a node by its label, (binned) span length, and surrounding words • a POSEdges feature uses surrounding POS tags WordEdges is local TOP f 400 ( y ) = f NP 2 saw with ( y ) = 1 S NP VP . POSEdges is non-local PRP VBD NP PP . f 800 ( y ) = f NP 2 VBD IN ( y ) = 1 I saw DT NN IN NP the boy with DT NN 2 words a telescope 31

WordEdges (C&J 05) • a WordEdges feature classifies a node by its label, (binned) span length, and surrounding words • a POSEdges feature uses surrounding POS tags WordEdges is local TOP f 400 ( y ) = f NP 2 saw with ( y ) = 1 S NP VP . POSEdges is non-local PRP VBD NP PP . f 800 ( y ) = f NP 2 VBD IN ( y ) = 1 I saw DT NN IN NP the boy with DT NN local features comprise 2 words ~70% of all instances! a telescope 31

Factorizing non-local features • going bottom-up, at each node • compute (partial values of) feature instances that become computable at this level • postpone those uncomputable to ancestors TOP unit instance of ParentRule S feature at the TOP node NP VP . PRP VBD NP PP . I saw DT NN IN NP the boy with DT NN a telescope 32

NGramTree (C&J 05) • an NGramTree captures the smallest tree fragment that contains a bigram (two consecutive words) • unit instances are boundary words between subtrees A i,k TOP S B i,j C j,k NP VP . w i . . . w j − 1 w j . . . w k − 1 PRP VBD NP PP . unit instance of node A I saw DT NN IN NP the boy with DT NN a telescope 33

NGramTree (C&J 05) • an NGramTree captures the smallest tree fragment that contains a bigram (two consecutive words) • unit instances are boundary words between subtrees A i,k TOP S B i,j C j,k NP VP . w i . . . w j − 1 w j . . . w k − 1 PRP VBD NP PP . unit instance of node A I saw DT NN IN NP the boy with DT NN a telescope Forest Reranking

Heads (C&J 05, Collins 00) • head-to-head lexical dependencies • we percolate heads bottom-up • unit instances are between the head word of the head child and the head words of non-head children TOP/ saw S/ saw NP/ I VP/ saw ./ . PRP/ I VBD/ saw NP/ the PP/ with . NN/ boy I saw DT/ the IN/ with NP/ a NN/ telescope the boy with DT/ a a telescope 35

Heads (C&J 05, Collins 00) • head-to-head lexical dependencies • we percolate heads bottom-up • unit instances are between the head word of the head child and the head words of non-head children TOP/ saw unit instances at VP node saw - the ; saw - with S/ saw NP/ I VP/ saw ./ . PRP/ I VBD/ saw NP/ the PP/ with . NN/ boy I saw DT/ the IN/ with NP/ a NN/ telescope the boy with DT/ a a telescope 35

Approximate Decoding • bottom-up, keeps top k derivations at each node • non-monotonic grid due to non-local features w · f N ( ) = 0.5 A i,k 1.0 3.0 8.0 B i,j C j,k 1.0 2.0 + 0.5 4.0 + 5.0 9.0 + 0.5 w i . . . w j − 1 w j . . . w k − 1 1.1 2.1 + 0.3 4.1 + 5.4 9.1 + 0.3 3.5 4.5 + 0.6 6.5 + 10.5 11.5 + 0.6 36

Approximate Decoding • bottom-up, keeps top k derivations at each node • non-monotonic grid due to non-local features w · f N ( ) = 0.5 A i,k 1.0 3.0 8.0 B i,j C j,k 1.0 2.5 9.0 9.5 w i . . . w j − 1 w j . . . w k − 1 1.1 2.4 9.5 9.4 3.5 5.1 17.0 12.1 37

Approximate Decoding • bottom-up, keeps top k derivations at each node • non-monotonic grid due to non-local features • priority queue for next-best • each iteration pops the best and pushes successors • extract unit non-local features on-the-fly A i,k 1.0 3.0 8.0 B i,j C j,k 1.0 2.5 9.0 9.5 1.1 2.4 9.5 9.4 w i . . . w j − 1 w j . . . w k − 1 3.5 5.1 17.0 12.1 38

Algorithm 2 => Cube Pruning • process all hyperedges simultaneously! significant savings of computation VP hyperedge PP 1, 3 VP 3, 6 PP 1, 4 VP 4, 6 NP 1, 2 VP 2, 3 PP 3, 6 bottom-neck: the time for on-the-fly non-local feature extraction 39

Forest vs. n-best Oracles • on top of Charniak parser (modified to dump forest) • forests enjoy higher oracle scores than n-best lists • with much smaller sizes 98.6 97.8 97.2 96.8 40

Main Results • pre-comp. is for feature-extraction (can be parallelized) • # of training iterations is determined on the dev set • forest reranking outperforms both 50- and 100-best baseline: 1-best Charniak parser 89.72 features n or k pre-comp. training F 1 % local 50 1.4G / 25h 1 x 0.3h 91.01 all 50 2.4G / 34h 5 x 0.5h 91.43 all 100 5.3G / 77h 5 x 1.3h 91.47 local - 3 x 1.4h 91.25 1.2G / 5.1h all 4 x 11h 91.69 k =15 41

Comparison with Others type system F 1 % Collins (2000) 89.7 Henderson (2004) 90.1 Charniak and Johnson (2005) 91.0 D updated (2006) 91.4 Petrov and Klein (2008) 88.3 this work 91.7 Bod (2000) 90.7 G Petrov and Klein (2007) 90.1 S McClosky et al. (2006) 92.1 best accuracy to date on the Penn Treebank 42

Outline • Packed Forests and Hypergraph Framework • Exact k -best Search in the Forest TOP • Approximate Joint Search S with Non-Local Features NP VP . PRP VBD NP PP . • Forest Reranking I saw DT NN IN NP • Machine Translation the boy with DT NN a telescope • Decoding w/ Language Models bigram • Forest Rescoring held ... talk with ... Sharon • Future Directions VP 3, 6 PP 1, 3 43

Statistical Machine Translation Spanish/English English Bilingual Text Text Statistical Analysis Statistical Analysis translation model (TM) language model (LM) Broken Spanish English competency fluency English What hunger have I Hungry I am so Have I that hunger Que hambre tengo yo I am so hungry I am so hungry How hunger have I ... (Knight and Koehn, 2003) 44

Statistical Machine Translation Spanish/English English Bilingual Text Text Statistical Analysis Statistical Analysis translation model (TM) language model (LM) Broken Spanish English competency fluency English k -best rescoring (Algorithm 3) What hunger have I Hungry I am so Have I that hunger Que hambre tengo yo I am so hungry I am so hungry How hunger have I ... (Knight and Koehn, 2003) 44

Statistical Machine Translation Spanish/English English Bilingual Text Text Statistical Analysis Statistical Analysis phrase-based TM translation model (TM) language model (LM) Broken n -gram LM Spanish English competency fluency English syntax-based decoder integrated decoder I am so hungry Que hambre tengo yo (LM-integrated) computationally challenging! ☹ 45

Forest Rescoring Spanish/English English Bilingual Text Text Statistical Analysis Statistical Analysis phrase-based TM translation model (TM) language model (LM) Broken n -gram LM Spanish English competency fluency English syntax-based packed forest as non-local info decoder integrated decoder forest rescorer I am so hungry Que hambre tengo yo (LM-integrated) 46

Syntax-based Translation • synchronous context-free grammars (SCFGs) • context-free grammar in two dimensions • generating pairs of strings/trees simultaneously • co-indexed nonterminal further rewritten as a unit PP (1) VP (2) , VP (2) PP (1) VP → VP juxing le huitan , held a meeting → PP yu Shalong , with Sharon → VP VP PP VP VP PP yu Shalong juxing le huitan held a meeting with Sharon 47

Translation as Parsing • translation with SCFGs => monolingual parsing • parse the source input with the source projection • build the corresponding target sub-strings in parallel PP (1) VP (2) , VP (2) PP (1) VP → VP juxing le huitan , held a meeting → PP yu Shalong , with Sharon → VP 1, 6 VP 3, 6 PP 1, 3 yu Shalong juxing le huitan 48

Translation as Parsing • translation with SCFGs => monolingual parsing • parse the source input with the source projection • build the corresponding target sub-strings in parallel PP (1) VP (2) , VP (2) PP (1) VP → VP juxing le huitan , held a meeting → held a talk with Sharon PP yu Shalong , with Sharon → VP 1, 6 with Sharon held a talk VP 3, 6 PP 1, 3 yu Shalong juxing le huitan 48

Adding a Bigram Model • exact dynamic programming • nodes now split into +LM items • with English boundary words • search space too big for exact search • beam search: keep at most k +LM items each node • but can we do better? held ... Sharon S 1, 6 bigram held ... talk with ... Sharon +LM items VP 3, 6 PP 1, 3 49

Non-Monotonic Grid ) ) ) n g n VP 1, 6 n o o r o r a a l a h h h S S S � � � g h h n t t o i VP 3, 6 PP 1, 3 i w l w 3 a 3 3 , , , 1 1 1 P P P P P P ( ( ( non-monotonicity due to LM combo costs 1.0 3.0 8.0 (VP held � meeting 1.0 2.0 + 0.5 4.0 + 5.0 9.0 + 0.5 ) 3,6 (VP held � talk 1.1 2.1 + 0.3 4.1 + 5.4 9.1 + 0.3 ) 3,6 (VP hold � conference 3.5 4.5 + 0.6 6.5 + 10.5 11.5 + 0.6 ) 3,6 50

Non-Monotonic Grid ) ) ) n g n VP 1, 6 n o o r o r a a l a h h h S S S bigram ( meeting, with ) � � � g h h n t t o i VP 3, 6 PP 1, 3 i w l w 3 a 3 3 , , , 1 1 1 P P P P P P ( ( ( non-monotonicity due to LM combo costs 1.0 3.0 8.0 (VP held � meeting 1.0 2.0 + 0.5 4.0 + 5.0 9.0 + 0.5 ) 3,6 (VP held � talk 1.1 2.1 + 0.3 4.1 + 5.4 9.1 + 0.3 ) 3,6 (VP hold � conference 3.5 4.5 + 0.6 6.5 + 10.5 11.5 + 0.6 ) 3,6 50

Algorithm 2 -Cube Pruning ) ) ) n g n VP 1, 6 n o o r o r a a l a h h h S S S � � � g h h n t t o i VP 3, 6 PP 1, 3 i w l w 3 a 3 3 , , , 1 1 1 P P P P P P ( ( ( 1.0 3.0 8.0 (VP held � meeting 1.0 2.5 9.0 9.5 ) 3,6 (VP held � talk 1.1 2.4 9.5 9.4 ) 3,6 (VP hold � conference 3.5 5.1 17.0 12.1 ) 3,6 51

Algorithm 2 => Cube Pruning k -best Algorithm 2, with search errors VP hyperedge PP 1, 3 VP 3, 6 PP 1, 4 VP 4, 6 NP 1, 4 VP 4, 6 process all hyperedges simultaneously! significant savings of computation 52

Phrase-based: Translation Accuracy speed ++ ~100 times faster quality++ Algorithm 2: 53

Syntax-based: Translation Accuracy speed ++ quality++ Algorithm 2: Algorithm 3: 54

Conclusion so far • General framework of DP on hypergraphs • monotonicity => exact 1-best algorithm • Exact k -best algorithms • Approximate search with non-local information • Forest Reranking for discriminative parsing • Forest Rescoring for MT decoding • Empirical Results • orders of magnitudes faster than previous methods • best Treebank parsing accuracy to date 55

Impact • These algorithms have been widely implemented in • state-of-the-art parsers • Charniak parser • McDonald’s dependency parser • MIT parser (Collins/Koo), Berkeley and Stanford parsers • DOP parsers (Bod, 2006/7) • major statistical MT systems • Syntax-based systems from ISI, CMU, BBN, ... • Phrase-based system: Moses [underway] 56

Future Directions

Further work on Forest Reranking • Better Decoding Algorithms • pre-compute most non-local features • use Algorithm 3 cube growing • intra-sentence level parallelized decoding • Combination with Semi-supervised Learning • easy to apply to self-training (McClosky et al., 2006) • Deeper and deeper Decoding (e.g., semantic roles) • Other Machine Learning Algorithms • Theoretical and Empirical Analysis of Search Errors 58

Machine Translation / Generation • Discriminative training using non-local features • local-features showed modest improvement on phrase-base systems (Liang et al., 2006) • plan for syntax-based (tree-to-string) systems • fast, linear-time decoding • Using packed parse forest for • tree-to-string decoding (Mi, Huang, Liu, 2008) • rule extraction (tree-to-tree) • Generation / Summarization: non-local constraints 59

Thanks! THE END - Thanks! Questions? Comments? 60

Speed vs. Search Quality tested on our faithful clone of Pharaoh speed ++ ( - log Prob ) quality ++ Huang and Chiang Forest Rescoring 61

Speed vs. Search Quality tested on our faithful clone of Pharaoh speed ++ ( - log Prob ) quality ++ 32 times faster Huang and Chiang Forest Rescoring 61

Speed vs. Search Quality tested on our faithful clone of Pharaoh speed ++ same parameters ( - log Prob ) quality ++ 32 times faster Huang and Chiang Forest Rescoring 61

Syntax-based: Search Quality speed ++ ( - log Prob ) quality ++ 10 times faster 62

Tree-to-String System • syntax-directed, English to Chinese (Huang, Knight, Joshi, 2006) • first parse input, and then recursively transfer synchronous tree- substitution grammars (STSG) VP (Galley et al., 2004; Eisner, 2003) VBD VP-C extended to translate was VP PP a packed-forest VBN PP IN NP-C instead of a tree (Mi, Huang, Liu, 2008) shot TO NP-C by DT NN to NN the police death 63

Forest-Based Search Algorithms for Parsing and Machine Translation - PowerPoint PPT Presentation

Forest-Based Search Algorithms for Parsing and Machine Translation Liang Huang University of Pennsylvania Google Research, March 14th, 2008 Search in NLP is not trivial! I saw her duck. Aravind Joshi 2 Search in NLP is not trivial!

Search Algorithms 3 AI Slides (6e) c Lin Zuoquan@PKU 2003-2020 3 1 3 Search Algorithms

U.S. Forest Service Forest Service U.S. Forest Inventory and Analysis Forest Service Research

Informed search algorithms Outline Best-first search Greedy best-first search A *

Logic-based Evaluation of Forest Logic-based Evaluation of Forest Ecosystem Sustainability

Epping Forest Arts Epping Forest Arts Epping Forest Councils Epping Forest Councils Arts

Forest management associations Forest owners own associations Forest Management Association is

CURRENT U.S. FOREST DATA AND MAPS Forest age FIA MapMaker Forest ownership TPO Data CURRENT

Search Problems and Algorithms T79.4201 Search Problems and Algorithms (4 ECTS) T-79.4201

Local search algorithms AIMA sections 4.1,4.2 Summary Local search algorithms Hill-climbing

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

based results vs based results based results based results based results based results vs

Forest Health Protection Priorities in the US Forest Service Rick Cooksey Continental Dialogue

National Forest Monitoring and National Forest Inventory at FAO FAO Forestry

US Forest Service Presentation Forest Health and Water Implications United State Forest Service

PERTINENT FACTS ABOUT THE FOREST SURVEY "What is the Forest Survey? Edward C. Crafts, Chief,

CS 10: Problem solving via Object Oriented Programming Winter

TSMO MO-365 All kinds o of w weather er ev events ts c can i impact s system em rel

Breve introduo a BACKUP Uma abordagem prtica (aka crash course on BACKUP :)) Prof. Rossano

Anomalous transport in random conformal field theory Per Moosavi KTH Royal Institute of

Introduction to Deep Processing Techniques for NLP Deep Processing Techniques for NLP Ling 571

Tests that (Almost) Write Themselves Hints for Golden Master Testing in Python Stefan Baerisch,

Melody Control Giacomo Driussi / Philipp Hoffmann Interaktion Engineering / WS 2018/2019 Prof.

Managing dependencies is more than running composer update Nils Adermann @naderman

Forest-Based Search Algorithms for Parsing and Machine Translation - PowerPoint PPT Presentation

Forest-Based Search Algorithms for Parsing and Machine Translation Liang Huang University of Pennsylvania Google Research, March 14th, 2008 Search in NLP is not trivial! I saw her duck. Aravind Joshi 2 Search in NLP is not trivial!

Search Algorithms 3 AI Slides (6e) c Lin Zuoquan@PKU 2003-2020 3 1 3 Search Algorithms

U.S. Forest Service Forest Service U.S. Forest Inventory and Analysis Forest Service Research

Informed search algorithms Outline Best-first search Greedy best-first search A *

Logic-based Evaluation of Forest Logic-based Evaluation of Forest Ecosystem Sustainability

Epping Forest Arts Epping Forest Arts Epping Forest Councils Epping Forest Councils Arts

Forest management associations Forest owners own associations Forest Management Association is

CURRENT U.S. FOREST DATA AND MAPS Forest age FIA MapMaker Forest ownership TPO Data CURRENT

Search Problems and Algorithms T79.4201 Search Problems and Algorithms (4 ECTS) T-79.4201

Local search algorithms AIMA sections 4.1,4.2 Summary Local search algorithms Hill-climbing

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

based results vs based results based results based results based results based results vs

Forest Health Protection Priorities in the US Forest Service Rick Cooksey Continental Dialogue

National Forest Monitoring and National Forest Inventory at FAO FAO Forestry

US Forest Service Presentation Forest Health and Water Implications United State Forest Service

PERTINENT FACTS ABOUT THE FOREST SURVEY &quot;What is the Forest Survey? Edward C. Crafts, Chief,

CS 10: Problem solving via Object Oriented Programming Winter

TSMO MO-365 All kinds o of w weather er ev events ts c can i impact s system em rel

Breve introduo a BACKUP Uma abordagem prtica (aka crash course on BACKUP :)) Prof. Rossano

Anomalous transport in random conformal field theory Per Moosavi KTH Royal Institute of

Introduction to Deep Processing Techniques for NLP Deep Processing Techniques for NLP Ling 571

Tests that (Almost) Write Themselves Hints for Golden Master Testing in Python Stefan Baerisch,

Melody Control Giacomo Driussi / Philipp Hoffmann Interaktion Engineering / WS 2018/2019 Prof.

Managing dependencies is more than running composer update Nils Adermann @naderman

PERTINENT FACTS ABOUT THE FOREST SURVEY "What is the Forest Survey? Edward C. Crafts, Chief,