[PPT] - Algorithms for NLP IITP, Fall 2019 Lecture 11: Parsing III Yulia PowerPoint Presentation

SLIDE 1

1

Yulia Tsvetkov

Algorithms for NLP

IITP, Fall 2019

Lecture 11: Parsing III

SLIDE 2

Syntactic Parsing

▪ INPUT:

▪ The move followed a round of similar increases by other

lenders, reflecting a continuing decline in that market ▪ OUTPUT:

SLIDE 3

Constituent trees

▪ Internal nodes correspond to phrases

▪ S – a sentence ▪ NP (Noun Phrase): My dog, a sandwich, lakes,.. ▪ VP (Verb Phrase): ate a sausage, barked, … ▪ PP (Prepositional phrases): with a friend, in a

car, …

▪ Nodes immediately above words are PoS tags (aka preterminals)

▪ PN – pronoun ▪ D – determiner ▪ V – verb ▪ N – noun ▪ P – preposition

SLIDE 4

Context Free Grammar (CFG)

▪ Other grammar formalisms: LFG, HPSG, TAG, CCG…

Grammar (CFG) Lexicon

ROOT → S S → NP VP NP → DT NN NP → NN NNS NN → interest NNS → raises VBP → interest VBZ → raises … NP → NP PP VP → VBP NP VP → VBP NP PP PP → IN NP

SLIDE 5

Constraints on the grammar

▪ The basic CKY algorithm supports only rules in the Chomsky Normal Form (CNF): ▪ Any CFG can be converted to an equivalent CNF

▪ Equivalent means that they define the same language ▪ However (syntactic) trees will look differently ▪ It is possible to address it by defining such transformations that allows for easy reverse transformation

SLIDE 6

Parsing with CKY

Preterminal rules Inner rules

SLIDE 7

Preterminal rules Inner rules

Chart (aka parsing triangle)

SLIDE 8

Preterminal rules Inner rules

SLIDE 9

Preterminal rules Inner rules

SLIDE 10

Preterminal rules Inner rules

SLIDE 11

Preterminal rules Inner rules

SLIDE 12

Preterminal rules Inner rules

SLIDE 13

Preterminal rules Inner rules

SLIDE 14

Preterminal rules Inner rules

SLIDE 15

Preterminal rules Inner rules

SLIDE 16

Preterminal rules Inner rules

SLIDE 17

Preterminal rules Inner rules

SLIDE 18

Preterminal rules Inner rules

SLIDE 19

Preterminal rules Inner rules

SLIDE 20

Preterminal rules Inner rules

SLIDE 21

Preterminal rules Inner rules

SLIDE 22

Preterminal rules Inner rules

SLIDE 23

Preterminal rules Inner rules

SLIDE 24

Preterminal rules Inner rules

mid=1

SLIDE 25

Preterminal rules Inner rules

mid=2

SLIDE 26

Preterminal rules Inner rules

The sentence is ambiguous for the grammar: (as the grammar overgenerates)

SLIDE 27

1.0 0.2 1.0 0.4 0.5 0.2 0.3 0.5 1.0 0.6 0.5 0.3 0.3 0.7

PCFGs

27

1.0 0.2 0.4 0.4 0.3 0.5 0.2 1.0 0.2 0.7 0.1 1.0 0.5 0.5 0.6 0.4 0.3 0.7

SLIDE 28

CKY with PCFGs

▪ Chart is represented by a 3d array of floats chart[min][max][label]

▪ It stores probabilities for the most probable subtree with a given signature

▪ chart[0][n][S] will store the probability of the most probable full parse tree

SLIDE 29

Intuition

For every C choose C1 , C2 and mid such that is maximal, where T1 and T2 are left and right subtrees.

SLIDE 30

Implementation: preterminal rules

SLIDE 31

Implementation: binary rules

max min

SLIDE 32

Recovery of the tree

▪ For each signature we store backpointers to the elements from which it was built

▪ start recovering from [0, n, S]

▪ What backpointers do we store?

SLIDE 33

Recovery of the tree

▪ For each signature we store backpointers to the elements from which it was built

▪ start recovering from [0, n, S]

▪ What backpointers do we store?

▪ rule ▪ for binary rules, midpoint

SLIDE 34

Speeding up the algorithm

▪ Basic pruning (roughly):

▪ For every span (i,j) store only labels which have the probability at most N times smaller than the probability of the most probable label for this span ▪ Check not all rules but only rules yielding subtree labels having non-zero probability

▪ Coarse-to-fine pruning

▪ Parse with a smaller (simpler) grammar, and precompute (posterior) probabilities for each spans, and use only the ones with non-negligible probability from the previous grammar

SLIDE 35

Parsing evaluation

▪ Intrinsic evaluation:

▪ Automatic: evaluate against annotation provided by human experts (gold standard) according to some predefined measure ▪ Manual: … according to human judgment

▪ Extrinsic evaluation: score syntactic representation by comparing how well a system using this representation performs on some task

▪ E.g., use syntactic representation as input for a semantic analyzer and compare results of the analyzer using syntax predicted by different parsers.

SLIDE 36

Standard evaluation setting in parsing

▪ Automatic intrinsic evaluation is used: parsers are evaluated against gold standard by provided by linguists

▪ There is a standard split into the parts:

▪ training set: used for estimation of model parameters ▪ development set: used for tuning the model (initial experiments) ▪ test set: final experiments to compare against previous work

SLIDE 37

Automatic evaluation of constituent parsers

▪ Exact match: percentage of trees predicted correctly ▪ Bracket score: scores how well individual phrases (and their boundaries) are identified

The most standard measure; we will focus on it

SLIDE 38

Brackets scores

▪ The most standard score is bracket score ▪ It regards a tree as a collection of brackets: ▪ The set of brackets predicted by a parser is compared against the set of brackets in the tree annotated by a linguist ▪ Precision, recall and F1 are used as scores

Subtree signatures for CKY

SLIDE 39

Preview: F1 bracket score

SLIDE 40

Constituent (phrase-structure) representation

SLIDE 41

Dependency representation

SLIDE 42

Dependency representation

▪ A dependency structure can be defined as a directed graph G, consisting of

▪ a set V of nodes – vertices, words, punctuation, morphemes ▪ a set A of arcs – directed edges, ▪ a linear precedence order < on V (word order).

▪ Labeled graphs

▪ nodes in V are labeled with word forms (and annotation). ▪ arcs in A are labeled with dependency types ▪ is the set of permissible arc labels; ▪ Every arc in A is a triple (i,j,k), representing a dependency from to with label .

SLIDE 43

Dependency vs Constituency

▪ Dependency structures explicitly represent

▪ head-dependent relations (directed arcs), ▪ functional categories (arc labels) ▪ possibly some structural categories (parts of speech)

▪ Phrase (aka constituent) structures explicitly represent

▪ phrases (nonterminal nodes), ▪ structural categories (nonterminal labels)

SLIDE 44

Dependency vs Constituency trees

SLIDE 45

Parsing Languages with Flexible Word Order

I prefer the morning flight through Denver Я предпочитаю утренний перелет через Денвер

SLIDE 46

I prefer the morning flight through Denver Я предпочитаю утренний перелет через Денвер Я предпочитаю через Денвер утренний перелет Утренний перелет я предпочитаю через Денвер Перелет утренний я предпочитаю через Денвер Через Денвер я предпочитаю утренний перелет Я через Денвер предпочитаю утренний перелет ...

Languages with free word order

SLIDE 47

Dependency relations

SLIDE 48

Types of relationships

▪ The clausal relations NSUBJ and DOBJ identify the arguments: the subject and direct object of the predicate cancel ▪ The NMOD, DET, and CASE relations denote modifiers of the nouns flights and Houston.

SLIDE 49

Grammatical functions

SLIDE 50

Dependency Constraints

▪ Syntactic structure is complete (connectedness)

▪ connectedness can be enforced by adding a special root node

▪ Syntactic structure is hierarchical (acyclicity)

▪ there is a unique pass from the root to each vertex

▪ Every word has at most one syntactic head (single-head constraint)

▪ except root that does not have incoming arcs

This makes the dependencies a tree

SLIDE 51

Projectivity

▪ Projective parse

▪ arcs don’t cross each other ▪ mostly true for English

▪ Non-projective structures are needed to account for

▪ long-distance dependencies ▪ flexible word order

SLIDE 52

Projectivity

▪ Dependency grammars do not normally assume that all dependency-trees are projective, because some linguistic phenomena can only be achieved using non-projective trees. ▪ But a lot of parsers assume that the output trees are projective ▪ Reasons

▪ conversion from constituency to dependency ▪ the most widely used families of parsing algorithms impose projectivity

SLIDE 53

Detecting Projectivity/Non-Projectivity

▪ The idea is to use the inorder traversal of the tree: <left-child, root, right-child>

▪ This is well defined for binary trees. We need to extend it to n-ary trees.

▪ If we have a projective tree, the inorder traversal will give us the original linear order.

SLIDE 54

Non-Projective Statistics

SLIDE 55

Dependency Treebanks

▪ the major English dependency treebanks converted from the WSJ sections of the PTB (Marcus et al., 1993) ▪ OntoNotes project (Hovy et al. 2006, Weischedel et al. 2011) adds conversational telephone speech, weblogs, usenet newsgroups, broadcast, and talk shows in English, Chinese and Arabic ▪ annotated dependency treebanks created for morphologically rich languages such as Czech, Hindi and Finnish, eg Prague Dependency Treebank (Bejcek et al., 2013) ▪ http://universaldependencies.org/

▪ 122 treebanks, 71 languages

SLIDE 56

Conversion from constituency to dependency

▪ Xia and Palmer (2001)

▪ mark the head child of each node in a phrase structure, using the appropriate head rules ▪ make the head of each non-head child depend on the head of the head-child

SLIDE 57

Parsing problem

The parsing problem for a dependency parser is to find the

ptimal dependency tree y given an input sentence x

This amounts to assigning a syntactic head i and a label l to every node j corresponding to a word xj in such a way that the resulting graph is a tree rooted at the node 0

SLIDE 58

Parsing problem

▪ This is equivalent to finding a spanning tree in the complete graph containing all possible arcs

SLIDE 59

Parsing algorithms

▪ Transition based

▪ greedy choice of local transitions guided by a goodclassifier ▪ deterministic ▪ MaltParser (Nivre et al. 2008)

▪ Graph based

▪ Minimum Spanning Tree for a sentence ▪ McDonald et al.’s (2005) MSTParser ▪ Martins et al.’s (2009) Turbo Parser

SLIDE 60

Transition Based Parsing

▪ greedy discriminative dependency parser ▪ motivated by a stack-based approach called shift-reduce parsing originally developed for analyzing programming languages (Aho & Ullman, 1972). ▪ Nivre 2003

SLIDE 61

Configuration

SLIDE 62

Configuration

Buffer: unprocessed words Stack: partially processed words Oracle: a classifier

SLIDE 63

Operations

Buffer: unprocessed words Stack: partially processed words Oracle: a classifier

At each step choose: ▪ Shift

SLIDE 64

Operations

Buffer: unprocessed words Stack: partially processed words Oracle: a classifier

At each step choose: ▪ Shift ▪ Reduce left

SLIDE 65

Operations

Buffer: unprocessed words Stack: partially processed words Oracle: a classifier

At each step choose: ▪ Shift ▪ LeftArc or Reduce left ▪ RightArc or Reduce right

SLIDE 66

Shift-Reduce Parsing

Configuration: ▪ Stack, Buffer, Oracle, Set of dependency relations Operations by a classifier at each step: ▪ Shift

▪ remove w1 from the buffer, add it to the top of the stack as s1

▪ LeftArc or Reduce left

▪ assert a head-dependent relation between s1 and s2 ▪ remove s2 from the stack

▪ RightArc or Reduce right

▪ assert a head-dependent relation between s2 and s1 ▪ remove s1 from the stack

SLIDE 67

Shift-Reduce Parsing

SLIDE 68

Shift-Reduce Parsing

SLIDE 69

Shift-Reduce Parsing

SLIDE 70

Shift-Reduce Parsing

SLIDE 71

Shift-Reduce Parsing

SLIDE 72

Shift-Reduce Parsing

SLIDE 73

Shift-Reduce Parsing

SLIDE 74

Shift-Reduce Parsing

SLIDE 75

Shift-Reduce Parsing

SLIDE 76

Shift-Reduce Parsing

SLIDE 77

Shift-Reduce Parsing

SLIDE 78

Shift-Reduce Parsing

SLIDE 79

Shift-Reduce Parsing

SLIDE 80

Shift-Reduce Parsing

Configuration: ▪ Stack, Buffer, Oracle, Set of dependency relations Operations by a classifier at each step: ▪ Shift

▪ remove w1 from the buffer, add it to the top of the stack as s1

▪ LeftArc or Reduce left

▪ assert a head-dependent relation between s1 and s2 ▪ remove s2 from the stack

▪ RightArc or Reduce right

▪ assert a head-dependent relation between s2 and s1 ▪ remove s1 from the stack

Complexity?

Oracle decisions can correspond to unlabeled

r labeled arcs

SLIDE 81

Training an Oracle

▪ Oracle is a supervised classifier that learns a function from the configuration to the next operation ▪ How to extract the training set?

SLIDE 82

Training an Oracle

▪ How to extract the training set?

▪ if LeftArc → LeftArc ▪ if RightArc

▪ if s1 dependents have been processed → RightArc

▪ else → Shift

SLIDE 83

▪ How to extract the training set?

▪ if LeftArc → LeftArc ▪ if RightArc

▪ if s1 dependents have been processed → RightArc

▪ else → Shift

Training an Oracle

SLIDE 84

Training an Oracle

▪ Oracle is a supervised classifier that learns a function from the configuration to the next operation ▪ How to extract the training set?

▪ if LeftArc → LeftArc ▪ if RightArc

▪ if s1 dependents have been processed → RightArc

▪ else → Shift

▪ What features to use?

SLIDE 85

Features

▪ POS, word-forms, lemmas on the stack/buffer ▪ morphological features for some languages ▪ previous relations ▪ conjunction features (e.g. Zhang&Clark’08; Huang&Sagae’10; Zhang&Nivre’11)

SLIDE 86

Learning

▪ Before 2014: SVMs, ▪ After 2014: Neural Nets

SLIDE 87

Chen & Manning 2014

Slides by Danqi Chen & Chris Manning

SLIDE 88

Chen & Manning 2014

SLIDE 89

Chen & Manning 2014

▪ Features

▪ s1, s2, s3, b1, b2, b3 ▪ leftmost/rightmost children of s1 and s2 ▪ leftmost/rightmost grandchildren of s1 and s2 ▪ POS tags for the above ▪ arc labels for children/grandchildren

SLIDE 90

Evaluation of Dependency Parsers

▪ LAS - labeled attachment score ▪ UAS - unlabeled attachment score

SLIDE 91

Chen & Manning 2014

SLIDE 92

Follow-up

SLIDE 93

Stack LSTMs (Dyer et al. 2015)

SLIDE 94

Arc-Eager

▪ LEFTARC: Assert a head-dependent relation between s1 and b1; pop the stack. ▪ RIGHTARC: Assert a head-dependent relation between s1 and b1; shift b1 to be s1. ▪ SHIFT: Remove b1 and push it to be s1. ▪ REDUCE: Pop the stack.

SLIDE 95

Arc-Eager

SLIDE 96

Beam Search

SLIDE 97

Parsing algorithms

▪ Transition based

▪ greedy choice of local transitions guided by a goodclassifier ▪ deterministic ▪ MaltParser (Nivre et al. 2008), Stack LSTM (Dyer et al. 2015)

▪ Graph based

▪ Minimum Spanning Tree for a sentence ▪ non-projective ▪ globally optimized ▪ McDonald et al.’s (2005) MSTParser ▪ Martins et al.’s (2009) Turbo Parser

SLIDE 98

Graph-Based Parsing Algorithms

▪ Start with a fully-connected directed graph ▪ Find a Minimum Spanning Tree

▪ Chu and Liu (1965) and Edmonds (1967) algorithm

edge-factored approaches

SLIDE 99

Chu-Liu Edmonds algorithm

Select best incoming edge for each node Subtract its score from all incoming edges Contract nodes if there are cycles Stopping condition Recursively compute MST Expand contracted nodes

SLIDE 100

Chu-Liu Edmonds algorithm

▪ Select best incoming edge for each node

SLIDE 101

Chu-Liu Edmonds algorithm

▪ Subtract its score from all incoming edges

SLIDE 102

Chu-Liu Edmonds algorithm

▪ Contract nodes if there are cycles

SLIDE 103

Chu-Liu Edmonds algorithm

▪ Recursively compute MST

SLIDE 104

Chu-Liu Edmonds algorithm

▪ Expand contracted nodes

SLIDE 105

Scores

▪ Wordforms, lemmas, and parts of speech of the headword and its dependent. ▪ Corresponding features derived from the contexts before, after and between the words. ▪ Word embeddings. ▪ The dependency relation itself. ▪ The direction of the relation (to the right or left). ▪ The distance from the head to the dependent.

SLIDE 106

Summary

▪ Transition-based

▪ + Fast ▪ + Rich features of context ▪ - Greedy decoding

▪ Graph-based

▪ + Exact or close to exact decoding ▪ - Weaker features

Well-engineered versions of the approaches achieve comparable accuracy (on English), but make different errors

→ combining the strategies results in a substantial boost in performance

SLIDE 107

HW 2: Parsing

11711, Fall 2019 Due Oct 17th

SLIDE 108

Assignment 2 is released!

SLIDE 109

Overview

Goal: 1) build a generative parser

Binarization + Parent Annotation + Markovization → learn PCFG
Implement CKY to compute Viterbi trees

2) implement coarse-to-fine pruning (extra credit) Evaluation: F1 score Data:

Penn Treebank: 2400 sentences + parsing tree

training 2k / valid 100 / test 100

SLIDE 110

Main Implementation

1. Main entry point: PCFGParserTester (+ baseline: BaselineParser)
2. Two classes you need to implement
GeneratvieParserFactory
CoarseToFineParserFactory (optional)
3. Methods you need to implement
getParser(List<Tree<String>> trainTrees)
getBestParse(List<String> sentence)

SLIDE 111

Example from BaselineParser

1. getParser(List<Tree<String>> trainTrees)
2. getBestParse(List<String> sentence)

SLIDE 112

Submission

Submit to Canvas (Assignment 2)

1. Submit the 'project.tgz' that contains following files
submit.jar (renamed from assign_parsing_submit.jar)
writeup.pdf
2. Before you submit, make sure you pass the sanity-check

SLIDE 113

Submission - writeup.pdf

1. The write-up should be 2-3 pages in length (+1 is acceptable, but we won't

read more than 4 pages), and should be written in the voice and style of a typical computer science conference paper.

2. Describe

(1) your implementation choices (annotation, markovization, ...) (2) report performance using appropriate graphs and tables, and (3) include some of your own investigation/error analysis on the results.

SLIDE 114

Grading

No hard requirements this time,

but following submissions are considered strong:

80 F1 on sentences of length 40 with Generative Parser
Coarse-to-fine speed up the parsing by x1.5 - 2
Reference:

training: takes 10-15 secs max_length 15: decode 17 secs, 86.01 F1 max_length 40: decode 664 secs, 79.92 F1

SLIDE 115

Recitation

There will be a recitation this Friday (Oct 4, 1:30 - 2:20 pm, DH 2302) It will cover 1) CKY parsing + coarse-to-fine pruning recap 2) Implementation tips