Natural Language Processing (CSE 517): Dependency Structure Noah - - PowerPoint PPT Presentation

natural language processing cse 517 dependency structure
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing (CSE 517): Dependency Structure Noah - - PowerPoint PPT Presentation

Natural Language Processing (CSE 517): Dependency Structure Noah Smith 2016 c University of Washington nasmith@cs.washington.edu February 24, 2016 1 / 45 Why might you want to use a generative classifier, such as Naive Bayes, as opposed


slide-1
SLIDE 1

Natural Language Processing (CSE 517): Dependency Structure

Noah Smith

c 2016 University of Washington nasmith@cs.washington.edu

February 24, 2016

1 / 45

slide-2
SLIDE 2

Why might you want to use a generative classifier, such as Naive Bayes, as opposed to a discriminative classifier, and vice versa? How can one deal with out-of-vocabulary words at test time when

  • ne is applying an HMM for POS tagging or a PCFG for parsing?

What is marginal inference, and how can it be carried out on a factor graph? What are the advantages and disadvantages of using a context-free grammar in Chomsky normal form?

2 / 45

slide-3
SLIDE 3

Starting Point: Phrase Structure

S NP DT The NN luxury NN auto NN maker NP JJ last NN year VP VBD sold NP CD 1,214 NN cars PP IN in NP DT the NNP U.S.

3 / 45

slide-4
SLIDE 4

Parent Annotation

(Johnson, 1998)

SROOT NPS DTNP The NNNP luxury NNNP auto NNNP maker NPS JJNP last NNNP year VPS VBDVP sold NPVP CDNP 1,214 NNNP cars PPVP INPP in NPPP DTNP the NNPNP U.S.

Increases the “vertical” Markov order: p(children | parent, grandparent)

4 / 45

slide-5
SLIDE 5

Headedness

S NP DT The NN luxury NN auto NN maker NP JJ last NN year VP VBD sold NP CD 1,214 NN cars PP IN in NP DT the NNP U.S.

Suggests “horizontal” markovization: p(children | parent) = p(head | parent) ·

  • i

p(ith sibling | head, parent)

5 / 45

slide-6
SLIDE 6

Lexicalization

Ssold NPmaker DTThe The NNluxury luxury NNauto auto NNmaker maker NPyear JJlast last NNyear year VPsold VBDsold sold NPcars CD1,214 1,214 NNcars cars PPin INin in NPU.S. DTthe the NNPU.S. U.S.

Each node shares a lexical head with its head child.

6 / 45

slide-7
SLIDE 7

Transformations on Trees

Starting around 1998, many different ideas—both linguistic and statistical—about how to transform treebank trees. All of these make the grammar larger—and therefore all frequencies became sparser—so a lot of research on smoothing the probability rules. Parent annotation, headedness, markovization, and lexicalization; also category refinement by linguistic rules (Klein and Manning, 2003).

◮ These are reflected in some versions of the popular Stanford

and Berkeley parsers.

7 / 45

slide-8
SLIDE 8

Tree Decorations

(Klein and Manning, 2003)

◮ Mark nodes with only 1 child as UNARY ◮ Mark DTs (determiners), RBs (adverbs) when they are only

children

◮ Annotate POS tags with their parents ◮ Split IN (prepositions; 6 ways), AUX, CC, % ◮ NPs: temporal, possessive, base ◮ VPs annotated with head tag (finite vs. others) ◮ DOMINATES-V ◮ RIGHT-RECURSIVE NP

8 / 45

slide-9
SLIDE 9

Machine Learning and Parsing

9 / 45

slide-10
SLIDE 10

Machine Learning and Parsing

◮ Define arbitrary features on trees, based on linguistic

knowledge; to parse, use a PCFG to generate a k-best list of parses, then train a log-linear model to rerank (Charniak and Johnson, 2005).

◮ K-best parsing: Huang and Chiang (2005) 10 / 45

slide-11
SLIDE 11

Machine Learning and Parsing

◮ Define arbitrary features on trees, based on linguistic

knowledge; to parse, use a PCFG to generate a k-best list of parses, then train a log-linear model to rerank (Charniak and Johnson, 2005).

◮ K-best parsing: Huang and Chiang (2005)

◮ Define rule-local features on trees (and any part of the input

sentence); minimize hinge or log loss.

◮ These exploit dynamic programming algorithms for training

(CKY for arbitrary scores, and the sum-product version).

11 / 45

slide-12
SLIDE 12

Structured Perceptron

Collins (2002)

Perceptron algorithm for parsing:

◮ For t ∈ {1, . . . , T}:

◮ Pick it uniformly at random from {1, . . . , n}. ◮ ˆ

tit ← argmax

t∈Txit

w · Φ(xit, t)

◮ w ← w − α

  • Φ(xit, ˆ

tit) − Φ(xit, tit)

  • This can be viewed as stochastic subgradient descent on the

structured hinge loss:

n

  • i=1

max

t∈Txit

w · Φ(xi, t)

  • fear

− w · Φ(xi, ti)

  • hope

12 / 45

slide-13
SLIDE 13

Beyond Structured Perceptron (I)

Structured support vector machine (also known as max margin parsing; Taskar et al., 2004):

n

  • i=1

max

t∈Txit

w · Φ(xi, t) + cost(tit, t)

  • fear

− w · Φ(xi, ti)

  • hope

where cost(ti, t) is the number of local errors (either constituent errors or “rule” errors).

13 / 45

slide-14
SLIDE 14

Beyond Structured Perceptron (II)

Log-loss, which gives parsing models analogous to conditional random fields (Miyao and Jun’ichi, 2002; Finkel et al., 2008):

n

  • i=1

log

  • t∈Txi

exp w · Φ(xi, t)

  • fear

− w · Φ(xi, ti)

  • hope

14 / 45

slide-15
SLIDE 15

Machine Learning and Parsing

◮ Define arbitrary features on trees, based on linguistic

knowledge; to parse, use a PCFG to generate a k-best list of parses, then train a log-linear model to rerank (Charniak and Johnson, 2005).

◮ K-best parsing: Huang and Chiang (2005)

◮ Define rule-local features on trees (and any part of the input

sentence); minimize hinge or log loss.

◮ These exploit dynamic programming algorithms for training

(CKY for arbitrary scores, and the sum-product version).

◮ Learn refinements on the constituents, as latent variables

(Petrov et al., 2006).

15 / 45

slide-16
SLIDE 16

Machine Learning and Parsing

◮ Define arbitrary features on trees, based on linguistic

knowledge; to parse, use a PCFG to generate a k-best list of parses, then train a log-linear model to rerank (Charniak and Johnson, 2005).

◮ K-best parsing: Huang and Chiang (2005)

◮ Define rule-local features on trees (and any part of the input

sentence); minimize hinge or log loss.

◮ These exploit dynamic programming algorithms for training

(CKY for arbitrary scores, and the sum-product version).

◮ Learn refinements on the constituents, as latent variables

(Petrov et al., 2006).

◮ Neural, too:

◮ Socher et al. (2013) define compositional vector grammars

that associate each phrase with a vector, calculated as a function of its subphrases’ vectors. Used essentially to rerank.

◮ Dyer et al. (2016): recurrent neural network grammars,

generative models like PCFGs that encode arbitrary previous derivation steps in a vector. Parsing requires some tricks.

16 / 45

slide-17
SLIDE 17

Dependencies

Informally, you can think of dependency structures as a transformation of phrase-structures that

◮ maintains the word-to-word relationships induced by

lexicalization,

◮ adds labels to them, and ◮ eliminates the phrase categories.

There are also linguistic theories built on dependencies (Tesni` ere, 1959; Mel’ˇ cuk, 1987), as well as treebanks corresponding to those.

◮ Free(r)-word order languages (e.g., Czech)

17 / 45

slide-18
SLIDE 18

Dependency Tree: Definition

Let x = x1, . . . , xn be a sentence. Add a special root symbol as “x0.” A dependency tree consists of a set of tuples p, c, ℓ, where

◮ p ∈ {0, . . . , n} is the index of a parent ◮ c ∈ {1, . . . , n} is the index of a child ◮ ℓ ∈ L is a label

Different annotation schemes define different label sets L, and different constraints on the set of tuples. Most commonly:

◮ The tuple is represented as a directed edge from xp to xc with

label ℓ.

◮ The directed edges form an arborescence (directed tree) with

x0 as the root.

18 / 45

slide-19
SLIDE 19

Example

S NP Pronoun we VP Verb wash NP Determiner

  • ur

Noun cats Phrase-structure tree.

19 / 45

slide-20
SLIDE 20

Example

S NP Pronoun we VP Verb wash NP Determiner

  • ur

Noun cats Phrase-structure tree with heads.

20 / 45

slide-21
SLIDE 21

Example

Swash NPwe Pronounwe we VPwash Verbwash wash NPcats Determinerour

  • ur

Nouncats cats Phrase-structure tree with heads, lexicalized.

21 / 45

slide-22
SLIDE 22

Example

we wash

  • ur

cats “Bare bones” dependency tree.

22 / 45

slide-23
SLIDE 23

Example

we wash

  • ur

cats who stink

23 / 45

slide-24
SLIDE 24

Example

we vigorously wash

  • ur

cats who stink

24 / 45

slide-25
SLIDE 25

Example

we vigorously wash

  • ur

cats and dogs who stink The bugbear of dependency syntax: coordination structures.

25 / 45

slide-26
SLIDE 26

Example

we vigorously wash

  • ur

cats and dogs who stink Make the first conjunct the head?

26 / 45

slide-27
SLIDE 27

Example

we vigorously wash

  • ur

cats and dogs who stink Make the coordinating conjunction the head?

27 / 45

slide-28
SLIDE 28

Example

we vigorously wash

  • ur

cats and dogs who stink Make the second conjunct the head?

28 / 45

slide-29
SLIDE 29

Dependency Schemes

◮ Transform the treebank: define “head rules” that can select

the head child of any node in a phrase-structure tree and label the dependencies.

◮ More powerful, less local rule sets, possibly collapsing some

words into arc labels.

◮ Stanford dependencies are a popular example (de Marneffe

et al., 2006).

◮ Direct annotation.

29 / 45

slide-30
SLIDE 30

Dependencies and Grammar

Context-free grammars can be used to encode dependency structures. For every head word and constellation of dependent children: Nhead → Nleftmost-sibling . . . Nhead . . . Nrightmost-sibling And for every head word: Nhead → head A bilexical dependency grammar binarizes the dependents, generating only one per rule, usually “outward” from the head. Such a grammar can produce only projective trees, which are (informally) trees in which the arcs don’t cross.

30 / 45

slide-31
SLIDE 31

Quick Reminder: CKY

i k N j + 1 k R i j L p(L R | N) i i N p(xi | N) 1 n S goal:

Each “triangle” item corresponds to a buildable phrase.

31 / 45

slide-32
SLIDE 32

CKY Example

we wash our cats goal

Pronoun Verb Poss. Noun NP VP S

32 / 45

slide-33
SLIDE 33

CKY for Bilexical Context-Free Grammars

i k Nxh j + 1 k Nxc i j Nxh p(Nxh Nxc | Nxh) i k Nxh j + 1 k Nxh i j Nxc p(Nxc Nxh | Nxh)

Here we ignore the initial and goal rules.

33 / 45

slide-34
SLIDE 34

Dependency Parsing with the Eisner Algorithm

(Eisner, 1996)

Items:

h d d h c h h c ◮ Both triangles indicate that xd is a descendant of xh. ◮ Both trapezoids indicate that xc can be attached as the child

  • f xh.

◮ In all cases, the words “in between” are descendants of xh.

34 / 45

slide-35
SLIDE 35

Dependency Parsing with the Eisner Algorithm

(Eisner, 1996)

Initialization:

i i i i p(xi | Nxi)1/2 p(xi | Nxi)1/2

Goal:

1 i i n p(Nxi | S) goal

35 / 45

slide-36
SLIDE 36

Dependency Parsing with the Eisner Algorithm

(Eisner, 1996)

Attaching a left dependent:

i j j + 1 k i k p(Nxi Nxk | Nxk)

Complete a left child:

j k i j i k

36 / 45

slide-37
SLIDE 37

Dependency Parsing with the Eisner Algorithm

(Eisner, 1996)

Attaching a right dependent:

i j j + 1 k i k p(Nxi Nxk | Nxi)

Complete a right child:

i j j k i k

37 / 45

slide-38
SLIDE 38

Eisner Algorithm Example

we wash our cats goal

38 / 45

slide-39
SLIDE 39

Slight Generalization

The Eisner algorithm can be used to find the projective tree with the highest score whenever the score of the dependency tree has this form:

  • p,c,ℓ∈t

s(p, c, ℓ; x) = exp

  • p,c,ℓ∈t

log s(p, c, ℓ; x) (Recall that a tree t consists of a set of parent/child/label tuples

  • f the form p, c, ℓ; see slide 18.)

This property of a scoring function is called arc factorization; McDonald et al. (2005) called it “edge-based factorization.”

39 / 45

slide-40
SLIDE 40

Remarks on Dependency Parsing

◮ Naively using CKY with the bilexical grammar will have O(n5)

runtime; Eisner gives us O(n3).

◮ Ask with CKY for phrase-structure, a narrow-to-wide ordering

is reasonable, but an agenda may make parsing faster.

◮ As with phrase-structure parsing, you can get better accuracy

with higher Markov order:

◮ horizontal (among siblings) ◮ vertical (grandparents)

◮ Transition-based approaches are popular among those who

want speed.

◮ What about the projectivity assumption?

◮ See the reading (McDonald et al., 2005)! 40 / 45

slide-41
SLIDE 41

Nonprojective Example

A hearing is scheduled

  • n

the issue today .

ROOT ATT ATT SBJ PU VC TMP PC ATT 41 / 45

slide-42
SLIDE 42

Final Notes on Parsing

◮ Formalisms that are more powerful than context-free

grammars include tree adjoining grammars, combinatory categorial grammars, and unification-based grammars.

◮ Very attractive from a linguistic point of view ◮ Large-scale annotation has been a challenge.

◮ What are parse trees good for?

◮ Syntax is a scaffold for semantics (as we’ll see next week), as

well as information extraction, question answering, and sometimes machine translation.

◮ Features in text categorization (e.g., sentiment) 42 / 45

slide-43
SLIDE 43

Readings and Reminders

◮ McDonald et al. (2005) ◮ Assignment 4 is due March 2. ◮ Submit a suggestion for an exam question by Friday at 5pm. ◮ Your project is due March 9.

43 / 45

slide-44
SLIDE 44

References I

Eugene Charniak and Mark Johnson. Coarse-to-fine n-best parsing and maxent discriminative reranking. In Proc. of ACL, 2005. Michael Collins. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proc. of EMNLP, 2002. Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning. Generating typed dependency parses from phrase structure parses. In Proc. of LREC, 2006. Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. Recurrent neural network grammars, 2016. To appear. Jason M. Eisner. Three new probabilistic models for dependency parsing: An

  • exploration. In Proc. of COLING, 1996.

Jenny Rose Finkel, Alex Kleeman, and Christopher D. Manning. Efficient, feature-based, conditional random field parsing. In Proc. of ACL, 2008. Liang Huang and David Chiang. Better k-best parsing. In Proc. of IWPT, 2005. Mark Johnson. PCFG models of linguistic tree representations. Computational Linguistics, 24(4):613–32, 1998. Dan Klein and Christopher D. Manning. Accurate unlexicalized parsing. In Proc. of ACL, 2003. Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajic. Non-projective dependency parsing using spanning tree algorithms. In Proceedings of HLT-EMNLP, 2005. URL http://www.aclweb.org/anthology/H/H05/H05-1066.

44 / 45

slide-45
SLIDE 45

References II

Igor A. Mel’ˇ

  • cuk. Dependency Syntax: Theory and Practice. State University Press of

New York, 1987. Yusuke Miyao and Tsujii Jun’ichi. Maximum entropy estimation for feature forests. In

  • Proc. of HLT, 2002.

Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. Learning accurate, compact, and interpretable tree annotation. In Proc. of COLING-ACL, 2006. Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. Parsing with compositional vector grammars. In Proc. of ACL, 2013. Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin Markov networks. In Advances in Neural Information Processing Systems 16. 2004.

  • L. Tesni`
  • ere. ´

El´ ements de Syntaxe Structurale. Klincksieck, 1959.

45 / 45