Natural Language Understanding Lecture 9: Dependency Parsing with - - PowerPoint PPT Presentation

natural language understanding
SMART_READER_LITE
LIVE PREVIEW

Natural Language Understanding Lecture 9: Dependency Parsing with - - PowerPoint PPT Presentation

Introduction Transition-based Parsing with Neural Nets Results and Analysis Natural Language Understanding Lecture 9: Dependency Parsing with Neural Networks Frank Keller School of Informatics University of Edinburgh keller@inf.ed.ac.uk


slide-1
SLIDE 1

Introduction Transition-based Parsing with Neural Nets Results and Analysis

Natural Language Understanding

Lecture 9: Dependency Parsing with Neural Networks Frank Keller

School of Informatics University of Edinburgh keller@inf.ed.ac.uk

February 13, 2017

Frank Keller Natural Language Understanding 1

slide-2
SLIDE 2

Introduction Transition-based Parsing with Neural Nets Results and Analysis

1 Introduction 2 Transition-based Parsing with Neural Nets

Network Architecture Embeddings Training and Decoding

3 Results and Analysis

Results Analysis Reading: Chen and Manning (2014).

Frank Keller Natural Language Understanding 2

slide-3
SLIDE 3

Introduction Transition-based Parsing with Neural Nets Results and Analysis

Dependency Parsing

Traditional dependency parsing (Nivre 2003): simple shift-reduce parser (see last lecture); classifier chooses which transition (parser action) to take for each word in the input sentence; features for classifier similar to MALT parser (last lecture):

word/PoS unigrams, bigrams, trigrams; state of the parser; dependency tree built so far.

Problems: feature templates need to be handcrafted; results in millions of features feature are sparse and slow to extract.

Frank Keller Natural Language Understanding 3

slide-4
SLIDE 4

Introduction Transition-based Parsing with Neural Nets Results and Analysis

Dependency Parsing

Chen and Manning (2014) propose: keep the simple shift-reduce parser; replace the classifier for transitions with a neural net; use dense features (embeddings) instead of sparse, handcrafted features. Results: efficient parser (up to twice as fast as standard MALT parser); good performance (about 2% higher precision than MALT).

Frank Keller Natural Language Understanding 4

slide-5
SLIDE 5

Introduction Transition-based Parsing with Neural Nets Results and Analysis Network Architecture Embeddings Training and Decoding

Network Architecture

Goal of the network: predict correct transition t ∈ T , based on configuration c. Relevant information:

1 words and PoS tags (e.g., has/VBZ); 2 head of words with dependency label (e.g., nsubj, dobj); 3 position of words on stack and buffer.

ROOT has VBZ He PRP nsubj has VBZ good JJ control NN . . Stack Buffer

Correct transition: SHIFT

ROOT has has has has VBZ VBZ good JJ control NN . .

Frank Keller Natural Language Understanding 5

slide-6
SLIDE 6

Introduction Transition-based Parsing with Neural Nets Results and Analysis Network Architecture Embeddings Training and Decoding

Network Architecture

· · · · · · · · · · · · Input layer: [xw, xt, xl] Hidden layer: h = (W w

1 xw + W t 1xt + W l 1xl + b1)3

Softmax layer: p = softmax(W2h) · · · · · · · · · · · · words POS tags arc labels ROOT has VBZ He PRP nsubj has VBZ good JJ control NN . . Stack Buffer Configuration ROOT has has VBZ VBZ good JJ control NN . .

Frank Keller Natural Language Understanding 6

slide-7
SLIDE 7

Introduction Transition-based Parsing with Neural Nets Results and Analysis Network Architecture Embeddings Training and Decoding

Activation Function

−1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 −1 −0.5 0.5 1

cube sigmoid tanh identity

Frank Keller Natural Language Understanding 7

slide-8
SLIDE 8

Introduction Transition-based Parsing with Neural Nets Results and Analysis Network Architecture Embeddings Training and Decoding

Revision: Embeddings

Input layer Hidden layer Output layer WV×N WV×N WV×N W'N×V yj hi x2k x1k xCk C×V-dim N-dim V-dim

CBOW (Mikolov et al. 2013): xik context words (one-hot) hi hidden units yj

  • utput units (one-hot)

W, W′ weight matrices V vocabulary size N size of hidden layer C number of context words

[Figure from Rong (2014).]

Frank Keller Natural Language Understanding 8

slide-9
SLIDE 9

Introduction Transition-based Parsing with Neural Nets Results and Analysis Network Architecture Embeddings Training and Decoding

Revision: Embeddings

Input layer Hidden layer Output layer WV×N WV×N WV×N W'N×V yj hi x2k x1k xCk C×V-dim N-dim V-dim

CBOW (Mikolov et al. 2013): xik context words (one-hot) hi hidden units yj

  • utput units (one-hot)

W, W′ weight matrices V vocabulary size N size of hidden layer C number of context words By embedding we mean the hidden layer h!

[Figure from Rong (2014).]

Frank Keller Natural Language Understanding 8

slide-10
SLIDE 10

Introduction Transition-based Parsing with Neural Nets Results and Analysis Network Architecture Embeddings Training and Decoding

Embeddings

Chen and Manning (2014) use the following word embeddings Sw (18 elements):

1 top three words on stack and buffer: s1, s2, s3, b1, b2, b3; 2 first and second leftmost/rightmost children of top two words

  • n stack: lc1(si), rc1(si), lc2(si), rc2(si), i = 1, 2;

3 leftmost of leftmost/rightmost of rightmost children of top

two words on the stack: lc1(lc1(si)), rc1(rc1(si)), i = 1, 2. Tag embeddings St (18 elements): same as word embeddings. Arc label embeddings Sl (12 elements): same as word embeddings, excluding those the six words on the stack/buffer.

Frank Keller Natural Language Understanding 9

slide-11
SLIDE 11

Introduction Transition-based Parsing with Neural Nets Results and Analysis Network Architecture Embeddings Training and Decoding

Training

Generate examples {(ci, ti)}m

i=1 from sentences with gold parse

trees using shortest stack oracle (always prefers LEFT-ARC(l) over SHIFT), where ci is a configuration, ti ∈ T a transition. Objective: minimize cross-entropy loss with l2-regularization: L(θ) = −

  • i

log pti + λ 2 ||θ||2 where pti is the probability of transition ti (from softmax layer), and θ is set of all parameters {W w

1 , W t 1 , W l 1, b1, W2, E w, E t, E l}.

Frank Keller Natural Language Understanding 10

slide-12
SLIDE 12

Introduction Transition-based Parsing with Neural Nets Results and Analysis Network Architecture Embeddings Training and Decoding

Training

Use pre-trained word embeddings to initialize E w; use random initialization within (−0.01, 0.01) for E t and E l. Word embeddings (Collobert et al. 2011) for English; 50-dimensional word2vec embeddings (Mikolov et al. 2013) for Chinese; compare with random initialization of Ew. Mini-batched AdaGrad for optimization, dropout with 0.5 rate. Tune parameters on development set based UAS. Hyper-parameters: embedding size d = 50, hidden layer size h = 200, regularization parameter λ = 10−8, initial learning rate of AdaGrad α = 0.01.

Frank Keller Natural Language Understanding 11

slide-13
SLIDE 13

Introduction Transition-based Parsing with Neural Nets Results and Analysis Network Architecture Embeddings Training and Decoding

Decoding

The parser performs greedy decoding: for each parsing step, extract all word, PoS, and label embeddings from current configuration c; compute the hidden layer h(c); pick transition with the highest score: t = argmaxt W2(t, ·)h(c); execute transition c → t(c).

Frank Keller Natural Language Understanding 12

slide-14
SLIDE 14

Introduction Transition-based Parsing with Neural Nets Results and Analysis Results Analysis

Results: English with CoNLL Dependencies

Parser Dev Test Speed UAS LAS UAS LAS (sent/s) standard 89.9 88.7 89.7 88.3 51 eager 90.3 89.2 89.9 88.6 63 Malt:sp 90.0 88.8 89.9 88.5 560 Malt:eager 90.1 88.9 90.1 88.7 535 MSTParser 92.1 90.8 92.0 90.5 12 Our parser 92.2 91.0 92.0 90.7 1013

Frank Keller Natural Language Understanding 13

slide-15
SLIDE 15

Introduction Transition-based Parsing with Neural Nets Results and Analysis Results Analysis

Results: English with Stanford Dependencies

Parser Dev Test Speed UAS LAS UAS LAS (sent/s) standard 90.2 87.8 89.4 87.3 26 eager 89.8 87.4 89.6 87.4 34 Malt:sp 89.8 87.2 89.3 86.9 469 Malt:eager 89.6 86.9 89.4 86.8 448 MSTParser 91.4 88.1 90.7 87.6 10 Our parser 92.0 89.7 91.8 89.6 654

Frank Keller Natural Language Understanding 14

slide-16
SLIDE 16

Introduction Transition-based Parsing with Neural Nets Results and Analysis Results Analysis

Results: Chinese

Parser Dev Test Speed UAS LAS UAS LAS (sent/s) standard 82.4 80.9 82.7 81.2 72 eager 81.1 79.7 80.3 78.7 80 Malt:sp 82.4 80.5 82.4 80.6 420 Malt:eager 81.2 79.3 80.2 78.4 393 MSTParser 84.0 82.1 83.0 81.2 6 Our parser 84.0 82.4 83.9 82.4 936

Frank Keller Natural Language Understanding 15

slide-17
SLIDE 17

Introduction Transition-based Parsing with Neural Nets Results and Analysis Results Analysis

Effect of Activation Function

PTB:CD PTB:SD CTB 80 85 90 UAS score cube tanh sigmoid identity

Frank Keller Natural Language Understanding 16

slide-18
SLIDE 18

Introduction Transition-based Parsing with Neural Nets Results and Analysis Results Analysis

Pre-trained Embeddings vs. Random Initialization

PTB:CD PTB:SD CTB 80 85 90 UAS score pre-trained random

Frank Keller Natural Language Understanding 17

slide-19
SLIDE 19

Introduction Transition-based Parsing with Neural Nets Results and Analysis Results Analysis

Effect of PoS and Label Embeddings

PTB:CD PTB:SD CTB 70 75 80 85 90 95 UAS score word+POS+label word+POS word+label word

Frank Keller Natural Language Understanding 18

slide-20
SLIDE 20

Introduction Transition-based Parsing with Neural Nets Results and Analysis Results Analysis

Visualization of PoS Embeddings

−600 −400 −200 200 400 600 −800 −600 −400 −200 200 400 600

−ROOT− IN DT NNP CD NN ‘‘ ’’ POS ( VBN NNS VBP , CC ) VBD RB TO . VBZ NNPS PRP PRP$ VB JJ MD VBG RBR : WP WDT JJR PDT RBS WRB JJS $ RP FW EX SYM # LS UH WP$

misc noun punctuation verb adverb adjective Frank Keller Natural Language Understanding 19

slide-21
SLIDE 21

Introduction Transition-based Parsing with Neural Nets Results and Analysis Results Analysis

Visualization Label Embeddings

−600 −400 −200 200 400 600 800 −1000 −800 −600 −400 −200 200 400 600 800

neg acomp det predet root infmod cop quantmod nn conj nsubj aux npadvmod csubj mwe possessive expl auxpass csubjpass advcl pcomp discourse dep partmod poss advmod appos prt number mark dobj parataxis prep ccomp num punct rcmod xcomp preconj pobj nsubjpass iobj amod cc tmod

misc clausal complement noun pre−modifier verbal auxiliaries subject preposition complement noun post−modifier

Frank Keller Natural Language Understanding 20

slide-22
SLIDE 22

Introduction Transition-based Parsing with Neural Nets Results and Analysis Results Analysis

Summary

Chen and Manning’s (2014) model builds on standard transition-based dependency parsing; uses neural net to select transitions; uses dense features (embeddings) instead of sparse, handcrafted features; embeddings over words, PoS, and arc labels; new cube activation function; good accuracy for English and Chinese dependency parsing; substantial improvement in speed.

Frank Keller Natural Language Understanding 21

slide-23
SLIDE 23

Introduction Transition-based Parsing with Neural Nets Results and Analysis Results Analysis

References

Chen, Danqi, and Christopher D. Manning. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 740–750. Doha. Collobert, Ronan, Jason Weston, L´ eon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12: 2493–2537. Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In

  • C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,

eds., Advances in Neural Information Processing Systems 26, 3111–3119. Red Hook, NY: Curran Associates. Nivre, Joakim. 2003. An efficient algorithm for projective dependency parsing. In Proceedings of the International Workshop on Parsing Technologies, 149–160. Nancy. Rong, Xin. 2014. word2vec parameter learning explained. Unpublished mansucript, arXiv:1411.2738.

Frank Keller Natural Language Understanding 22