On the internals of disco-dop How to implement a state-of-the-art - PowerPoint PPT Presentation

On the internals of disco-dop How to implement a state-of-the-art LCFRS parser Kilian Gebhardt Grundlagen der Programmierung, Fakult¨ at Informatik, TU Dresden November 16, 2018 1/17

Motivation ◮ LCFRS parsing is hard ( O ( n m ∗ k ) where n , m , and k are sentence length, maximum numbers of nonterminals in a rule, and the fanout of the grammar, respectively.) 2/17

Motivation ◮ LCFRS parsing is hard ( O ( n m ∗ k ) where n , m , and k are sentence length, maximum numbers of nonterminals in a rule, and the fanout of the grammar, respectively.) ◮ Exact inference with real world LCFRS might feasible up to length 30 (see Angelov and Ljungl¨ of 2014)? 2/17

Motivation ◮ LCFRS parsing is hard ( O ( n m ∗ k ) where n , m , and k are sentence length, maximum numbers of nonterminals in a rule, and the fanout of the grammar, respectively.) ◮ Exact inference with real world LCFRS might feasible up to length 30 (see Angelov and Ljungl¨ of 2014)? ◮ We want to parse longer sentences and short sentences faster! 2/17

disco-dop ◮ Parsing framework developed by Andreas van Cranenburgh (cf. Cranenburgh, Scha, and Bod 2016) 3/17

disco-dop ◮ Parsing framework developed by Andreas van Cranenburgh (cf. Cranenburgh, Scha, and Bod 2016) ◮ Uses discontinuous data-oriented model (discontinuous tree-substitution grammar) at its core. 3/17

disco-dop ◮ Parsing framework developed by Andreas van Cranenburgh (cf. Cranenburgh, Scha, and Bod 2016) ◮ Uses discontinuous data-oriented model (discontinuous tree-substitution grammar) at its core. ◮ Employs a coarse-to-fine pipeline for parsing: 1. PCFG stage 2. LCFRS stage 3. DOP stage 3/17

The coarse-to-fine pipeline (grammars) ◮ The DOP model is equivalent to marginalizing over a latently annotated LCFRS (fine LCFRS) (see Goodman 2003 for continuous case). 1 See unknownword6 and unknownword4 in https://github.com/andreasvc/disco-dop/blob/master/discodop/lexicon.py 4/17

The coarse-to-fine pipeline (grammars) ◮ The DOP model is equivalent to marginalizing over a latently annotated LCFRS (fine LCFRS) (see Goodman 2003 for continuous case). ◮ The original treebank t 1 is binarized/Markovized (= t 2 ) and a coarse prob. LCFRS is induced. (Grammar is binarized, simple, ordered, may contain chain rules) 1 See unknownword6 and unknownword4 in https://github.com/andreasvc/disco-dop/blob/master/discodop/lexicon.py 4/17

The coarse-to-fine pipeline (grammars) ◮ The DOP model is equivalent to marginalizing over a latently annotated LCFRS (fine LCFRS) (see Goodman 2003 for continuous case). ◮ The original treebank t 1 is binarized/Markovized (= t 2 ) and a coarse prob. LCFRS is induced. (Grammar is binarized, simple, ordered, may contain chain rules) ◮ Discontinuity in t 2 is resolved by splitting categories. After binarizing again, we obtain t 3 and induce a PCFG. (Grammar is binarized, simple, may contain chain rules.) 1 See unknownword6 and unknownword4 in https://github.com/andreasvc/disco-dop/blob/master/discodop/lexicon.py 4/17

The coarse-to-fine pipeline (grammars) ◮ The DOP model is equivalent to marginalizing over a latently annotated LCFRS (fine LCFRS) (see Goodman 2003 for continuous case). ◮ The original treebank t 1 is binarized/Markovized (= t 2 ) and a coarse prob. LCFRS is induced. (Grammar is binarized, simple, ordered, may contain chain rules) ◮ Discontinuity in t 2 is resolved by splitting categories. After binarizing again, we obtain t 3 and induce a PCFG. (Grammar is binarized, simple, may contain chain rules.) ◮ Some preprocessing is applied to lexical rules to handle unknown words. (Stanford signatures 1 ) 1 See unknownword6 and unknownword4 in https://github.com/andreasvc/disco-dop/blob/master/discodop/lexicon.py 4/17

The coarse-to-fine pipeline (application) ◮ Parse with stage s resulting in chart. 5/17

The coarse-to-fine pipeline (application) ◮ Parse with stage s resulting in chart. ◮ If successful, obtain a whitelist of items from chart: 5/17

The coarse-to-fine pipeline (application) ◮ Parse with stage s resulting in chart. ◮ If successful, obtain a whitelist of items from chart: ◮ k = 0: select all items that are part of successful derivation 5/17

The coarse-to-fine pipeline (application) ◮ Parse with stage s resulting in chart. ◮ If successful, obtain a whitelist of items from chart: ◮ k = 0: select all items that are part of successful derivation ◮ 0 < k < 1: select each item i , where α ( i ) · β ( i ) ≥ k 5/17

The coarse-to-fine pipeline (application) ◮ Parse with stage s resulting in chart. ◮ If successful, obtain a whitelist of items from chart: ◮ k = 0: select all items that are part of successful derivation ◮ 0 < k < 1: select each item i , where α ( i ) · β ( i ) ≥ k ◮ k ≥ 1: select all items that occur in k -best derivations 5/17

The coarse-to-fine pipeline (application) ◮ Parse with stage s resulting in chart. ◮ If successful, obtain a whitelist of items from chart: ◮ k = 0: select all items that are part of successful derivation ◮ 0 < k < 1: select each item i , where α ( i ) · β ( i ) ≥ k ◮ k ≥ 1: select all items that occur in k -best derivations (For PCFG → PLCFRS k = 10 , 000 is the default.) 5/17

The coarse-to-fine pipeline (application) ◮ Parse with stage s resulting in chart. ◮ If successful, obtain a whitelist of items from chart: ◮ k = 0: select all items that are part of successful derivation ◮ 0 < k < 1: select each item i , where α ( i ) · β ( i ) ≥ k ◮ k ≥ 1: select all items that occur in k -best derivations (For PCFG → PLCFRS k = 10 , 000 is the default.) ◮ Next stage s + 1 prunes item i , if coarsify( i ) is not in whitelist. 5/17

The coarse-to-fine pipeline (application) ◮ Parse with stage s resulting in chart. ◮ If successful, obtain a whitelist of items from chart: ◮ k = 0: select all items that are part of successful derivation ◮ 0 < k < 1: select each item i , where α ( i ) · β ( i ) ≥ k ◮ k ≥ 1: select all items that occur in k -best derivations (For PCFG → PLCFRS k = 10 , 000 is the default.) ◮ Next stage s + 1 prunes item i , if coarsify( i ) is not in whitelist. ◮ If unsuccessful, stop parsing and greedily/recursively select the largest possible items from chart as fallback strategy. 5/17

Representation of LCFRS rules I A → � x (1) 1 x (2) 1 x (1) 2 , x (2) 2 x (1) 3 x (1) 4 � ( B , C ) 6/17

Representation of LCFRS rules I A → � x (1) x (2) x (1) , x (2) x (1) x (1) � ( B , C ) 1 1 2 2 3 4 �� i − 1 if x ( i ) 0 1 0 1 0 0 j 0 0 1 0 0 1 1 if end of component 6/17

Representation of LCFRS rules I A → � x (1) x (2) x (1) , x (2) x (1) x (1) � ( B , C ) 1 1 2 2 3 4 �� i − 1 if x ( i ) 0 1 0 1 0 0 j 0 0 1 0 0 1 1 if end of component struct ProbRule { // total: 32 bytes. double prob; // 8 bytes uint32_t lhs; // 4 bytes uint32_t rhs1; // 4 bytes uint32_t rhs2; // 4 bytes uint32_t args; // 4 bytes => 32 max vars per rule uint32_t lengths; // 4 bytes => same uint32_t no; // 4 bytes }; e.g. args = 0b001010 and lengths = 0b100100 . 6/17

Representation of LCFRS rules II 2. A → � x (1) 1 , x (1) 2 x (1) 3 � ( B ) (same, with rhs2 = 0 ) 7/17

Representation of LCFRS rules II 2. A → � x (1) 1 , x (1) 2 x (1) 3 � ( B ) (same, with rhs2 = 0 ) 3. A → � α � stored via a map Σ → vector<uint32_t> and a vector<LexicalRule> where: struct LexicalRule { double prob; uint32_t lhs; }; 7/17

PCFG parsing I bottom-up chart parsing (based on Bodenstab 2009’s fast grammar loop) populate_pos(chart, grammar, sentence) 1 2 for span in range(2, n+1): 3 for left in range(1, n + 1 - span): 4 right = left + span 5 for lhs in grammar.nonts: 6 for rule in grammar.rules[lhs]: 7 for mid in range(left + 1, right): 8 p1 = chart.getprob(left, mid, rule.rhs1) 9 p2 = chart.getprob(mid, right, rule.rhs2) 10 p_new = rule.prob + p1 + p2 11 if chart.updateprob(left, right, p_new): 12 chart.add_edge( ... ) 13 14 applyunary(left, right, chart, grammar) 15 8/17

PCFG parsing II beam search (based on Zhang et al. 2010) ◮ local beam search by beam thresholding with parameters η = 10 − 4 , δ = 40 9/17

PCFG parsing II beam search (based on Zhang et al. 2010) ◮ local beam search by beam thresholding with parameters η = 10 − 4 , δ = 40 ◮ If span ≤ δ and p_new < η · p_best4cell , then prune. 9/17

PCFG parsing II beam search (based on Zhang et al. 2010) ◮ local beam search by beam thresholding with parameters η = 10 − 4 , δ = 40 ◮ If span ≤ δ and p_new < η · p_best4cell , then prune. ◮ Only applied to binary rules. 9/17

On the internals of disco-dop How to implement a state-of-the-art - PowerPoint PPT Presentation

On the internals of disco-dop How to implement a state-of-the-art LCFRS parser Kilian Gebhardt Grundlagen der Programmierung, Fakult at Informatik, TU Dresden November 16, 2018 1/17 Motivation LCFRS parsing is hard ( O ( n m k )

Windows 8 Heap Internals Windows 8 Heap Internals Windows 8 Heap Internals INTRODUCTION Windows 8

Downs Industry Schools Co-Op (DISCO) Working with Cross Generational Teams Downs Industry

A RELOAD Usage for Distributed Conference Control (DisCo) Update draft-knauf-p2psip-disco-02

Secret Debian Internals Enrico Zini enrico@debian.org 25 February 2007 Enrico Zini

Roadmap for Section 4.3. Windows Process and Thread Internals Thread Block, Process Block Flow

QEMU internals Chad D. Kersey January 28, 2009 Chad D. Kersey QEMU internals The basics

The Cooperative Framework Agreement (CFA) and the DoP on the GERD John R Nyaoro (PhD), HSC

DOP-BAG Bonnie, Manon, Jolien Dopper Dutch brand producing bottles Brand

Using Disco and MapReduce to study mRNA complexity Dan Williams SciPy 2011 Lightning Talk

Off-chain Tejaswi Nadahalli ETH Zurich Distributed Computing Group www.disco.ethz.ch ETH

Chrome OS Internals Josh Triplett josh@joshtriplett.org LinuxCon Europe 2014 Josh Triplett

Ltac Internals Pierre-Marie Pdrot INRIA Coq Implementor Workshop . . . . . . . . . .

Secret Debian Internals Enrico Zini enrico@debian.org 25 February 2007 Enrico Zini

Haow do I sandbox?!?! Cuckoo Sandbox Internals Jurriaan Bremer @skier t Student (University of

Debian Installer Internals Frans Pop DebConf 6, Oaxtepec, Mexico Frans Pop Debian Installer

Hi High-throughput drug di disco cove very y scree eening facility in a a high gh con

Item 7 - Policy Strategies Potential Options 1 Publi lic Co Comments To Provide Public

PARSEME WG 3 Improving PP attachment in a hybrid dependency parser using semantic,

SRC support for all of the above. Gilroy Vandentop Dir. Intel Corporate University Research, SRC

A comment on: A tale of two factions: why and when factional demographic faultlines hurt

Michigan Food Hub Network Webinar June 13, 2013 Megan

GENERAL BODY MEETING ISLAMIC CENTER OF MARYLAND (ICM) May 17, 2014 ISLAMIC CENTER OF MARYLAND

Real Changes, Improving Performance Kellogg Company Barclays Global Consumer Staples Conference

An Overview of the Council October 2013 For more information: Sam Hummel Jason Pearson