Dynamic Feature Selection for Dependency Parsing He He, Hal Daum - PowerPoint PPT Presentation

+ next feature group 31 gray edges with unknown fate... 74 features per gray edge . the firms were ready $ This time , Undetermined edge Current 1-best tree Winner edge (permanently in 1-best tree) 41 Loser edge

31 gray edges with unknown fate... 74 features per gray edge . the firms were ready $ This time , Undetermined edge Non-projective decoding to Current 1-best tree find new 1-best tree Winner edge (permanently in 1-best tree) 42 Loser edge

28 gray edges with unknown fate... 74 features per gray edge . the firms were ready $ This time , Undetermined edge Classifier picks winners Current 1-best tree among the blue edges Winner edge (permanently in 1-best tree) 43 Loser edge

8 gray edges with unknown fate... 74 features per gray edge . the firms were ready $ This time , Undetermined edge Remove losers in conflict Current 1-best tree with the winners Winner edge (permanently in 1-best tree) 44 Loser edge

+ next feature group 8 gray edges with unknown fate... 107 features per gray edge . the firms were ready $ This time , Undetermined edge Current 1-best tree Winner edge (permanently in 1-best tree) 46 Loser edge

8 gray edges with unknown fate... 107 features per gray edge . the firms were ready $ This time , Undetermined edge Non-projective decoding to Current 1-best tree find new 1-best tree Winner edge (permanently in 1-best tree) 47 Loser edge

7 gray edges with unknown fate... 107 features per gray edge . the firms were ready $ This time , Undetermined edge Classifier picks winners Current 1-best tree among the blue edges Winner edge (permanently in 1-best tree) 48 Loser edge

+ last feature group 3 gray edges with unknown fate... 268 features per gray edge . the firms were ready $ This time , Undetermined edge Current 1-best tree Winner edge (permanently in 1-best tree) 51 Loser edge

0 gray edge with unknown fate... 268 features per gray edge . the firms were ready $ This time , Undetermined edge Projective decoding to find Current 1-best tree final 1-best tree Winner edge (permanently in 1-best tree) 52 Loser edge

What Happens During the Average Parse? 53

What Happens During the Average Parse? Most edges win or lose early 54

What Happens During the Average Parse? Most edges win or lose early Some edges win late 55

What Happens During the Average Parse? Later features are helpful Most edges win or lose early Some edges win late 56

What Happens During the Average Parse? Later features are helpful Most edges win or lose early Linear increase in runtime Some edges win late 57

Summary: How Early Decisions Are Made ● Winners – Will definitely appear in the 1-best tree ● Losers – Have the same child as a winning edge – Form cycle with winning edges – Cross a winning edge (optional) – Share root ($) with a winning edge (optional) ● Undetermined – Add the next feature group to the remaining gray edges 58

Feature Template Ranking ● Forward selection 59

Feature Template Ranking ● Forward selection A 0.60 B 0.49 C 0.55 60

Feature Template Ranking ● Forward selection 1 A A 0.60 A B 0.49 C 0.55 61

Feature Template Ranking ● Forward selection 1 A A 0.60 A A&B 0.80 B 0.49 A&C 0.85 C 0.55 62

Feature Template Ranking ● Forward selection 1 A A 0.60 A A&B 0.80 C 2 C B 0.49 A&C 0.85 C 0.55 63

Feature Template Ranking ● Forward selection 1 A A 0.60 A A&B 0.80 C 2 C B 0.49 A&C&B 0.9 A&C 0.85 3 B C 0.55 64

Feature Template Ranking ● Forward selection 1 A A 0.60 A A&B 0.80 C 2 C B 0.49 A&C&B 0.9 A&C 0.85 3 B C 0.55 ● Grouping head cPOS+ mod cPOS + in-between punct # 0.49 in-between cPOS 0.59 ⋮ head POS + mod POS + in-between conj # 0.71 head POS + mod POS + in-between POS + dist 0.72 ⋮ head token + mod cPOS + dist 0.80 ⋮ 65

Feature Template Ranking ● Forward selection 1 A A 0.60 A A&B 0.80 C 2 C B 0.49 A&C&B 0.9 A&C 0.85 3 B C 0.55 ● Grouping head cPOS+ mod cPOS + in-between punct # 0.49 in-between cPOS 0.59 + ~0.1 ⋮ head POS + mod POS + in-between conj # 0.71 head POS + mod POS + in-between POS + dist 0.72 ⋮ + ~0.1 head token + mod cPOS + dist 0.80 ⋮ 66

Partition Feature List Into Groups 67

How to pick the winners? 68

How to pick the winners? ● Learn a classifier 69

How to pick the winners? ● Learn a classifier ● Features – Currently added parsing features – Meta-features -- confidence of a prediction 70

How to pick the winners? ● Learn a classifier ● Features – Currently added parsing features – Meta-features -- confidence of a prediction ● Training examples – Input: each blue edge in current 1-best tree – Output: is the edge in the gold tree? If so, we want it to win! 71

Classifier Features ● Currently added parsing features ● Meta-features – : …, 0.5, 0.8, 0.85 the firms (scores are normalized by the sigmoid function) – Margins to the highest-scoring competing edge 0.72 0.65 0.30 0.23 . the firms were $ This time 0.12 – Index of the next feature group 72

Classifier Features ● Currently added parsing features ● Meta-features – : …, 0.5, 0.8, 0.85 the firms (scores are normalized by the sigmoid function) – Margins to the highest-scoring competing edge 0.72 0.65 0.30 0.23 . the firms were $ This time 0.12 – Index of the next feature group 73

Classifier Features ● Currently added parsing features ● Meta-features Dynamic Features – : …, 0.5, 0.8, 0.85 the firms (scores are normalized by the sigmoid function) – Margins to the highest-scoring competing edge 0.72 0.65 0.30 0.23 . the firms were $ This time 0.12 – Index of the next feature group 74

How To Train With Dynamic Features ● Training examples are not fixed in advance! ● Winners/losers from stages < k affect: – Set of edges to classify at stage k – The dynamic features of those edges at stage k ● Bad decisions can cause future errors 75

How To Train With Dynamic Features ● Training examples are not fixed in advance!! ● Winners/losers from stages < k affect: – Set of edges to classify at stage k – The dynamic features of those edges at stage k ● Bad decisions can cause future errors Reinforcement / Imitation Learning ● Dataset Aggregation (DAgger) ( Ross et al., 2011 ) – Iterates between training and running a model – Learns to recover from past mistakes 76

Upper Bound of Our Performance ● “Labels” – Gold edges always win – 96.47% UAS with 2.9% first-order features . . the the firms were ready firms were ready $ $ This This time time , , 77

How To Train Our Parser 1.Train parsers (non-projective, projective) using all features 2.Rank and group feature templates 3.Iteratively train a classifier to decide winners/losers 78

Experiment ● Data – Penn Treebank: English – CoNLL-X: Bulgarian, Chinese, German, Japanese, Portuguese, Swedish ● Parser – MSTParser (McDonald et al., 2006) ● Dynamically-trained Classifier – LibLinear (Fan et al., 2008) 79

Dynamic Feature Selection Beats Static Forward Selection 80

Dynamic Feature Selection Beats Static Forward Selection Always add the next feature group to all edges Add features as needed 81

Experiment: 1st-order 2x to 6x speedup 6 5 4 p DynFS 3 u d e Baseline e 2 p S 1 0 Chinese German Portuguese Bulgarian English Japanese Swedish 82

Experiment: 1st-order ~0.2% loss in accuracy 100.3% 100.2% 100.1% y 100.0% c a 99.9% r u c DynFS 99.8% c a 99.7% Baseline e 99.6% v i t a 99.5% l e 99.4% R 99.3% Chinese German Portuguese Bulgarian English Japanese Swedish relative accuracy = accuracy of the pruning parser accuracy of the full parser 83

Second-order Dependency Parsing ● Features depend on the siblings as well were ready . ● First-order: ● O(n 2 ) substructure to score $ were ● Second-order: ● O(n 3 ) substructure to score , . time ready firms ~380 feature templates This ~96M features the ● Decoding: still O(n 3 ) 84

Experiment: 2nd-order 2x to 8x speedup 9 8 7 6 5 p DynFS u d 4 e Baseline e 3 p S 2 1 0 Chinese German Portuguese Bulgarian English Japanese Swedish 85

Experiment: 2nd-order ~0.3% loss in accuracy 100.3% 100.2% 100.1% y 100.0% c a 99.9% r u c DynFS 99.8% c a 99.7% Baseline e 99.6% v i t a 99.5% l e 99.4% R 99.3% Chinese German Portuguese Bulgarian English Japanese Swedish 86

Ours vs Vine Pruning (Rush and Petrov, 2012) ● Vine pruning : a very fast parser that speeds up using orthogonal techniques – Start with short edges ( fully scored) – Add long edges in if needed ● Ours – Start with all edges ( partially scored) – Quickly remove unneeded edges ● Could be combined for further speedup! 87

VS Vine Pruning: 1st-order comparable performance 6 5 4 DynFS p 3 u d VineP e e Baseline 2 p S 1 0 Chinese German Portuguese Bulgarian English Japanese Swedish 88

VS Vine Pruning: 1st-order 100.3% 100.2% 100.1% y 100.0% c a 99.9% r DynFS u c 99.8% c VineP a 99.7% e Baseline 99.6% v i t a 99.5% l e 99.4% R 99.3% Chinese German Portuguese Bulgarian English Japanese Swedish 89

VS Vine Pruning: 2nd-order 16 14 12 10 DynFS p 8 u d VineP e 6 e Baseline p S 4 2 0 Chinese German Portuguese Bulgarian English Japanese Swedish 90

VS Vine Pruning: 2nd-order 100.3% 100.2% 100.1% y 100.0% c a 99.9% r DynFS u c 99.8% c VineP a 99.7% e Baseline 99.6% v i t a 99.5% l e 99.4% R 99.3% Chinese German Portuguese Bulgarian English Japanese Swedish 91

Conclusion ● Feature computation is expensive in structured prediction ● Commitment should be made dynamically ● Early commitment to edges reduce both searching and scoring time ● Can be used in other feature-rich models for structured prediction 92

Backup Slides 93

Static dictionary pruning (Rush and Petrov, 2012) VB CD: 18 → VB CD: 3 ← NN VBG: 22 → NN VBG: 11 ← ... . the firms were ready $ This time , 94

Reinforcement Learning 101 ● Markov Decision Process (MDP) – State: all the information helping us to make decisions – Action: things we choose to do – Reward: criteria for evaluating actions – Policy: the “brain” that makes the decision ● Goal – Maximize the expected future reward 95

Policy Learning ● Markov Decision Process (MDP) π ( + context) = add / lock the firms – reward = accuracy + λ∙speed ● Reinforcement learning – Delayed reward – Long time to converge ● Imitation learning – Mimic the oracle – Reduced to supervised classification problem 96

Imitation Learning ● Oracle – (near) optimal performance – generate target action in any given state π ( + context) = lock the firms π ( + context) = add the time , ... Binary classifier 97

Dataset Aggregation (DAgger) ● Collect data from the oracle only – Different distribution at training and test time ● Iterative policy training ● Correct the learner's mistake ● Obtain a policy performs well under its own policy distribution 98

Experiment (1st-order) 45.00% 40.00% 35.00% 30.00% t s 25.00% o c DynFS 20.00% e r u 15.00% t a e 10.00% F 5.00% 0.00% Chinese German Portuguese Bulgarian English Japanese Swedish cost = # feature templates used total # feature templates on the statically pruned graph 99

Experiment (2nd-order) 80.00% 70.00% 60.00% 50.00% t s o 40.00% c DynFS e r 30.00% u t a 20.00% e F 10.00% 0.00% Chinese German Portuguese Bulgarian English Japanese Swedish 100

Dynamic Feature Selection for Dependency Parsing He He, Hal Daum - PowerPoint PPT Presentation

Dynamic Feature Selection for Dependency Parsing He He, Hal Daum III and Jason Eisner EMNLP 2013, Seattle Structured Prediction in NLP Part-of-Speech Tagging Parsing N N V Det N

Robust Incremental Neural Semantic Graph Parsing Jan Buys and Phil Blunsom Dependency Parsing vs

Dependency Parsing & Feature-based Parsing Ling571 Deep Processing Techniques for NLP

Graph Based Dependency Parsing Wei Qiu December 15, 2011 . . . . . . Graph Based

Dependency Parsing II CMSC 470 Marine Carpuat Graph-based Dependency Parsing Slides credit:

Introduction to Bottom-Up Parsing Shift-reduce parsing The LR parsing algorithm

Natural Language Processing Other Syntactic Models Parsing IV Dan Klein UC Berkeley Dependency

Dependency Parsing CMSC 723 / LING 723 / INST 725 Marine Carpuat Fig credits: Joakim Nivre, Dan

Dependency Parsing 2 CMSC 723 / LING 723 / INST 725 Marine Carpuat Fig credits: Joakim Nivre,

Marina Valeeva Outline 2 1. Introduction What is Dependency Parsing? What is a

Statistical Parsing Dependency parsing ar ltekin University of Tbingen Seminar fr

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

CSC 4181 Compiler Construction Parsing 1 1 Outline Top-down v.s. Bottom-up Top-down parsing

Lecture 19: Dependency Grammars and Dependency Parsing Julia Hockenmaier juliahmr@illinois.edu

Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representation Eliyahu

Dependency Grammars and Parsing CMSC 473/673 UMBC Outline Review: PCFGs and CKY Dependency

Thoughts on Learner Data and Motivation Learner Language Dependency Parsing and Dependency

The Quest for the One T rue Parser Terence Parr The ANTLR guy University of San Francisco

The Parse Machine Chris Healy Department of Computer Science Furman University Rationale

Bottom up parsing Construct a parse tree for an input string beginning at leaves and going

Presented ented by: Ms. Laura a Grosso, o, Title e I Coordinat inator/ or/Libr Library ary

Something from nothing Arne Skjrholt LTG seminar T HE PROJECT U SING C ZECH TO PARSE L ATIN T

Alpha Presentation Stack Life 2.0: Library Search and Display Tool The Capstone Experience Team

Identifiability and Unmixing of Latent Parse Trees Daniel Hsu, Sham Kakade, Percy Liang NIPS

Understanding Database Usage in PHP Systems: Current and Future Work Mark Hills (@hillsma on

Dynamic Feature Selection for Dependency Parsing He He, Hal Daum - PowerPoint PPT Presentation

Dynamic Feature Selection for Dependency Parsing He He, Hal Daum III and Jason Eisner EMNLP 2013, Seattle Structured Prediction in NLP Part-of-Speech Tagging Parsing N N V Det N

Robust Incremental Neural Semantic Graph Parsing Jan Buys and Phil Blunsom Dependency Parsing vs

Dependency Parsing &amp; Feature-based Parsing Ling571 Deep Processing Techniques for NLP

Graph Based Dependency Parsing Wei Qiu December 15, 2011 . . . . . . Graph Based

Dependency Parsing II CMSC 470 Marine Carpuat Graph-based Dependency Parsing Slides credit:

Introduction to Bottom-Up Parsing Shift-reduce parsing The LR parsing algorithm

Natural Language Processing Other Syntactic Models Parsing IV Dan Klein UC Berkeley Dependency

Dependency Parsing CMSC 723 / LING 723 / INST 725 Marine Carpuat Fig credits: Joakim Nivre, Dan

Dependency Parsing 2 CMSC 723 / LING 723 / INST 725 Marine Carpuat Fig credits: Joakim Nivre,

Marina Valeeva Outline 2 1. Introduction What is Dependency Parsing? What is a

Statistical Parsing Dependency parsing ar ltekin University of Tbingen Seminar fr

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

CSC 4181 Compiler Construction Parsing 1 1 Outline Top-down v.s. Bottom-up Top-down parsing

Lecture 19: Dependency Grammars and Dependency Parsing Julia Hockenmaier juliahmr@illinois.edu

Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representation Eliyahu

Dependency Grammars and Parsing CMSC 473/673 UMBC Outline Review: PCFGs and CKY Dependency

Thoughts on Learner Data and Motivation Learner Language Dependency Parsing and Dependency

The Quest for the One T rue Parser Terence Parr The ANTLR guy University of San Francisco

The Parse Machine Chris Healy Department of Computer Science Furman University Rationale

Bottom up parsing Construct a parse tree for an input string beginning at leaves and going

Presented ented by: Ms. Laura a Grosso, o, Title e I Coordinat inator/ or/Libr Library ary

Something from nothing Arne Skjrholt LTG seminar T HE PROJECT U SING C ZECH TO PARSE L ATIN T

Alpha Presentation Stack Life 2.0: Library Search and Display Tool The Capstone Experience Team

Identifiability and Unmixing of Latent Parse Trees Daniel Hsu, Sham Kakade, Percy Liang NIPS

Understanding Database Usage in PHP Systems: Current and Future Work Mark Hills (@hillsma on

Dependency Parsing & Feature-based Parsing Ling571 Deep Processing Techniques for NLP