Natural Language Processing with Deep Learning CS224N/Ling284
Christopher Manning Lecture 18: Tree Recursive Neural Networks, Constituency Parsing, and Sentiment
Natural Language Processing with Deep Learning CS224N/Ling284 - - PowerPoint PPT Presentation
Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 18: Tree Recursive Neural Networks, Constituency Parsing, and Sentiment Lecture Plan: Lecture 18: Tree Recursive Neural Networks, Constituency Parsing,
Christopher Manning Lecture 18: Tree Recursive Neural Networks, Constituency Parsing, and Sentiment
Lecture 18: Tree Recursive Neural Networks, Constituency Parsing, and Sentiment
2
3
People interpret the meaning of larger text units – entities, descriptive terms, facts, arguments, stories – by semantic composition of smaller elements
4
La Langua uage un understanding – & Ar Arti tifi ficial I Inte telligence – re require ires be bein ing abl ble to to unde ders rsta tand d big bigger r th thin ings fro from knowin ing abo bout t smaller r pa parts
7
purchased [the firm that you used to work at]]]]
9
Penn Treebank tree
11
x2 x1 0 1 2 3 4 5 6 7 8 9 10
5 4 3 2 1
Monday
9 2
Tuesday
9.5 1.5
By mapping them into the same vector space!
1 5 1.1 4
the country of my birth the place where I was born
How can we represent the meaning of longer phrases?
France
2 2.5
Germany
1 3 12
the country of my birth
0.4 0.3 2.3 3.6 4 4.5 7 7 2.1 3.3 2.5 3.8 5.5 6.1 1 3.5 1 5
Use principle of compositionality The meaning (vector) of a sentence is determined by (1) the meanings of its words and (2) the rules that combine them.
Models in this section can jointly learn parse trees and compositional vector representations
x2 x1
0 1 2 3 4 5 6 7 8 9 10 5 4 3 2 1 the country of my birth the place where I was born Monday Tuesday France Germany
Socher, Manning, and Ng. ICML, 2011
13
9 1 5 3 8 5 9 1 4 3
NP NP PP S
7 1
VP The cat sat on the mat.
14
NP NP PP S VP
5 2 3 3 8 3 5 4 7 3
The cat sat on the mat.
9 1 5 3 8 5 9 1 4 3 7 1 15
the country of my birth
0.4 0.3 2.3 3.6 4 4.5 7 7 2.1 3.3 2.5 3.8 5.5 6.1 1 3.5 1 5
the country of my birth
0.4 0.3 2.3 3.6 4 4.5 7 7 2.1 3.3 4.5 3.8 5.5 6.1 1 3.5 1 5 2.5 3.8 16
require a tree structure
cannot capture phrases without prefix context and often capture too much
the country of my birth
0.4 0.3 2.3 3.6 4 4.5 7 7 2.1 3.3 2.5 3.8 5.5 6.1 1 3.5 1 5
the country of my birth
0.4 0.3 2.3 3.6 4 4.5 7 7 2.1 3.3 4.5 3.8 5.5 6.1 1 3.5 1 5 2.5 3.8 17
Recursive Neural Networks for Structure Prediction
9 1 4 3 3 3 8 3 8 5 3 3
Neural Network
8 3
1.3
Inputs: two candidate children’s representations Outputs:
8 5 18
score = UTp p = tanh(W + b),
Same W parameters at all nodes
8 5 3 3
Neural Network
8 3
1.3
score = = parent c1 c2
c1 c2
19
Neural Network
0.1
2 Neural Network
0.4
1 Neural Network
2.3
3 3 9 1 5 3 8 5 9 1 4 3 7 1 Neural Network
3.1
5 2 Neural Network
0.3
1
The cat sat on the mat.
20
9 1 5 3 5 2 Neural Network
1.1
2 1 Neural Network
0.1
2 Neural Network
0.4
1 Neural Network
2.3
3 3 5 3 8 5 9 1 4 3 7 1
The cat sat on the mat.
21
5 2 Neural Network
1.1
2 1 Neural Network
0.1
2 3 3 Neural Network
3.6
8 3 9 1 5 3 5 3 8 5 9 1 4 3 7 1
The cat sat on the mat.
22
5 2 3 3 8 3 5 4 7 3 9 1 5 3 5 3 8 5 9 1 4 3 7 1
The cat sat on the mat.
23
the sum of the parsing decision scores at each node:
8 5 3 3
RNN
8 3
1.3 24
max-margin objective
25
also a function of smaller regions,
larger objects,
Similar principle of compositionality.
26
Same Recursive Neural Network as for natural language parsing! (Socher et al. ICML 2011)
Features
Grass Tree
Segments Semantic Representations
People Building
Parsing Natural Scene Images Parsing Natural Scene Images
27
Method Accuracy Pixel CRF (Gould et al., ICCV 2009) 74.3 Classifier on superpixel features 75.9 Region-based energy (Gould et al., ICCV 2009) 76.4 Local labelling (Tighe & Lazebnik, ECCV 2010) 76.9 Superpixel MRF (Tighe & Lazebnik, ECCV 2010) 77.5 Simultaneous MRF (Tighe & Lazebnik, ECCV 2010) 77.5 Recursive Neural Network 78.1
Stanford Background Dataset (Gould et al. 2009)
28
Introduced by Goller & Küchler (1996) – old stuff! Principally the same as general backpropagation Calculations resulting from the recursion and tree structure:
δ(l) = ⇣ (W (l))T δ(l+1)⌘ f 0(z(l)), ∂ ∂W (l) ER = δ(l+1)(a(l))T + λW (l).
29
You can actually assume it’s a different W at each node Intuition via example: If we take separate derivatives of each occurrence, we get same:
30
During forward prop, the parent is computed using 2 children Hence, the errors need to be computed wrt each of them: where each child’s error is n-dimensional
8 5 3 3 8 3
c1 p = tanh(W + b)
c1 c2
c2
8 5 3 3 8 3
c1 c2
31
message from own score
8 5 3 3 8 3
c1 c2
parent score
32
33
δ(l) = ⇣ (W (l))T δ(l+1)⌘ f 0(z(l)), ∂ ∂W (l) ER = δ(l+1)(a(l))T + λW (l).
34
phenomena but not adequate for more complex, higher order composition and parsing long sentences
for all syntactic categories, punctuation, etc.
W c1 c2 p Wscore s
35
adequate for basic syntactic structure
children to choose the composition matrix
matrix for different syntactic environments
[Socher, Bauer, Manning, Ng 2013]
36
search needs a matrix-vector product.
coming from a simpler, faster model (PCFG)
children for each beam candidate
37
parsers
syntactic categories
category with a lexical item
factored parser.
continuous ones
38
Parser Test, All Sentences Stanford PCFG, (Klein and Manning, 2003a) 85.5 Stanford Factored (Klein and Manning, 2003b) 86.6 Factored PCFGs (Hall and Klein, 2012) 89.4 Collins (Collins, 1997) 87.7 SSN (Henderson, 2004) 89.4 Berkeley Parser (Petrov and Klein, 2007) 90.1 CVG (RNN) (Socher et al., ACL 2013) 85.0 CVG (SU-RNN) (Socher et al., ACL 2013) 90.4 Charniak - Self Trained (McClosky et al. 2006) 91.0 Charniak - Self Trained-ReRanked (McClosky et al. 2006) 92.1
39
NP-CC NP-PP PP-NP PRP$-NP
40
ADJP-NP ADVP-ADJP JJ-NP DT-NP
41
All the figures are adjusted for seasonal variations
Knight-Ridder wouldn’t comment on the offer
Sales grew almost 7% to $UNK m. from $UNK m.
42
Version 3: Compositionality Through Recursive Matrix-Vector Spaces
One way to make the composition function more powerful was by untying the weights W But what if words act mostly as an operator, e.g. “very” in very good Proposal: A new composition function
p = tanh(W + b)
c1 c2 Before:
[Socher, Huval, Bhat, Manning, & Ng, 2012]
43
Compositionality Through Recursive Matrix-Vector Recursive Neural Networks
p = tanh(W + b)
c1 c2
p = tanh(W + b)
C2c1 C1c2
44
Matrix-vector RNNs
[Socher, Huval, Bhat, Manning, & Ng, 2012]
45
Good example for non-linearity in language
46
conveys a semantic relationship?
à component-whole relationship (e2,e1)
constituent including both terms
47
Classifier Features F1 SVM POS, stemming, syntactic patterns 60.1 MaxEnt POS, WordNet, morphological features, noun compound system, thesauri, Google n-grams 77.6 SVM POS, WordNet, prefixes, morphological features, dependency parse features, Levin classes, PropBank, FrameNet, NomLex-Plus, Google n-grams, paraphrases, TextRunner 82.2 RNN – 74.8 MV-RNN – 79.1 MV-RNN POS, WordNet, NER 82.4
48
multiplicatively
Socher, Perelygin, Wu, Chuang, Manning, Ng, and Potts 2013
Is the tone of a piece of text positive, negative, or neutral?
… … loved … … … … … great … … … … … … impressed … … … … … … marvelous … … … …
http://nlp.stanford.edu:8080/sentiment/
75 76 77 78 79 80 81 82 83 84 Training with Sentence Labels Training with Treebank Bi NB RNN MV-RNN
Idea: Allow both additive and mediated multiplicative interactions of vectors
a classifier like logistic regression
74 76 78 80 82 84 86 Training with Sentence Labels Training with Treebank Bi NB RNN MV-RNN RNTN
Classifying Sentences: Accuracy improves to 85.4
biword NB (58%) and RNN (54%)
When negating negatives, positive activation should increase!
Demo: http://nlp.stanford.edu:8080/sentiment/
[Tai et al., ACL 2015; also Zhu et al. ICML 2015]
Goals:
in a (high-dimensional, continuous) vector space
sentence meaning
Gates are vectors in [0,1]d multiplied element-wise for soft masking
61
Generalizes sequential LSTM to trees with any branching factor
63
Generalizes sequential LSTM to trees with any branching factor
64
68
Gilles Louppe, Kyunghun Cho, Cyril Becot, Kyle Cranmer (2017)
Tree-to-tree Neural Networks for Program Translation
[Chen, Liu, and Song NeurIPS 2018]
translation between programming languages
70
Tree-to-tree Neural Networks for Program Translation
[Chen, Liu, and Song NeurIPS 2018]
71
Tree-to-tree Neural Networks for Program Translation
[Chen, Liu, and Song NeurIPS 2018]
72
73