 
              Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 18: Tree Recursive Neural Networks, Constituency Parsing, and Sentiment
Lecture Plan: Lecture 18: Tree Recursive Neural Networks, Constituency Parsing, and Sentiment 1. Motivation: Compositionality and Recursion (10 mins) 2. Structure prediction with simple Tree RNN: Parsing (20 mins) 3. Backpropagation through Structure (5 mins) 4. More complex TreeRNN units (35 mins) 5. Other uses of tree-recursive neural nets (5 mins) 6. Institute for Human-Centered Artificial Intelligence (5 mins) 2
Last minute project tips • Nothing works and everything is too slow à Panic • Simplify model à Go back to basics: bag of vectors + nnet • Make a very small network and/or dataset for debugging • Once no bugs: increase model size • Make sure you can overfit to your training dataset • Plot your training and dev errors over training iterations • Once its working, then regularize with L2 and Dropout • Then if you have time, do some hyperparameter search • Talk to us in office hours! 3
1. The spectrum of language in CS 4
Semantic interpretation of language – Not just word vectors How can we work out the meaning of larger phrases? • The snowboarder is leaping over a mogul • A person on a snowboard jumps into the air People interpret the meaning of larger text units – entities, descriptive terms, facts, arguments, stories – by semantic composition of smaller elements 5
Compositionality
La Langua uage un understanding – & Ar Arti tifi ficial I Inte telligence – re require ires be bein ing abl ble to to unde ders rsta tand d big bigger r th thin ings fro from knowin ing abo bout t smaller r pa parts 8
Are languages recursive? • Cognitively somewhat debatable (need to head to infinity) • But: recursion is natural for describing language • [The person standing next to [the man from [the company that purchased [the firm that you used to work at]]]] • noun phrase containing a noun phrase containing a noun phrase • It’s a very powerful prior for language structure 10
Penn Treebank tree 12
2. Building on Word Vector Space Models x 2 1 5 5 4 1.1 4 Germany 1 3 3 9 2 France 2 2 Monday 2.5 Tuesday 1 9.5 1.5 0 1 2 3 4 5 6 7 8 9 10 x 1 the country of my birth the place where I was born How can we represent the meaning of longer phrases? By mapping them into the same vector space! 13
How should we map phrases into a vector space? Socher, Manning, and Ng. ICML, 2011 Use principle of compositionality The meaning (vector) of a sentence x 2 is determined by the country of my birth 5 (1) the meanings of its words and the place where I was born 4 (2) the rules that combine them. Germany 3 France Monday 2 Tuesday 1 0 1 2 3 4 5 6 7 8 9 10 x 1 1 5 Models in this section 5.5 can jointly learn parse 6.1 1 trees and compositional 2.5 3.5 3.8 vector representations 0.4 2.1 7 4 2.3 0.3 3.3 7 4.5 3.6 the country of my birth 14
Constituency Sentence Parsing: What we want S VP PP NP NP 9 5 7 8 9 4 1 3 1 5 1 3 The cat sat on the mat. 15
Learn Structure and Representation 5 S 4 VP 7 3 8 PP 3 5 NP 2 3 NP 3 9 5 7 8 9 4 1 3 1 5 1 3 The cat sat on the mat. 16
Recursive vs. recurrent neural networks 1 5 5.5 6.1 1 2.5 3.5 3.8 0.4 2.1 7 4 2.3 0.3 3.3 7 4.5 3.6 the country of my birth 4.5 2.5 1 1 5.5 3.8 3.8 3.5 5 6.1 0.4 2.1 7 4 2.3 0.3 3.3 7 4.5 3.6 the country of my birth 17
Recursive vs. recurrent neural networks • Recursive neural nets 1 5 require a tree structure 5.5 6.1 1 2.5 3.5 3.8 0.4 2.1 7 4 2.3 0.3 3.3 7 4.5 3.6 the country of my birth • Recurrent neural nets 1 1 5.5 4.5 2.5 3.8 3.8 3.5 5 6.1 cannot capture phrases without prefix context 0.4 2.1 7 4 2.3 and often capture too much 0.3 3.3 7 4.5 3.6 the country of my birth of last words in final vector 18
Recursive Neural Networks for Structure Prediction Inputs: two candidate children’s representations Outputs: 1. The semantic representation if the two nodes are merged. 2. Score of how plausible the new node would be. 8 8 1.3 3 3 3 Neural 3 Network 8 9 4 5 1 3 8 3 5 3 on the mat. 19
Recursive Neural Network Definition 8 score = 1.3 = parent 3 score = U T p Neural c 1 p = tanh ( W + b ) , Network c 2 Same W parameters at all nodes 8 3 of the tree 5 3 c 1 c 2 20
Parsing a sentence with an RNN (greedily) 5 0 1 2 3 2 1 0 3.1 0.3 0 0.4 3 0.1 2.3 Neural Neural Neural Neural Neural Network Network Network Network Network 9 5 7 8 9 4 1 3 1 5 1 3 The cat sat on the mat. 21
Parsing a sentence 2 1 1.1 Neural 1 2 3 Network 0 0 0.4 3 0.1 2.3 5 Neural Neural Neural 2 Network Network Network 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The cat sat on the mat. 22
Parsing a sentence 8 2 3 3.6 1 1.1 Neural Neural 2 Network Network 0 0.1 5 Neural 2 3 Network 3 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The cat sat on the mat. 23
Parsing a sentence 5 4 7 3 8 3 5 2 3 3 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The cat sat on the mat. 24
Max-Margin Framework - Details • The score of a tree is computed by the sum of the parsing decision scores at each node: 8 1.3 3 RNN 8 3 5 3 • x is sentence; y is parse tree 25
Max-Margin Framework - Details • Similar to max-margin parsing (Taskar et al. 2004), a supervised max-margin objective • The loss penalizes all incorrect decisions • Structure search for A(x) was greedy (join best nodes each time) • Instead: Beam search with chart 26
Scene Parsing Similar principle of compositionality. The meaning of a scene image is • also a function of smaller regions, how they combine as parts to form • larger objects, and how the objects interact. • 27
Algorithm for Parsing Images Same Recursive Neural Network as for natural language parsing! (Socher et al. ICML 2011) Parsing Natural Scene Images Parsing Natural Scene Images People Building Grass Tree Semantic Representations Features Segments 28
Multi-class segmentation Method Accuracy Pixel CRF (Gould et al., ICCV 2009) 74.3 Classifier on superpixel features 75.9 Region-based energy (Gould et al., ICCV 2009) 76.4 Local labelling (Tighe & Lazebnik, ECCV 2010) 76.9 Superpixel MRF (Tighe & Lazebnik, ECCV 2010) 77.5 Simultaneous MRF (Tighe & Lazebnik, ECCV 2010) 77.5 Recursive Neural Network 78.1 Stanford Background Dataset (Gould et al. 2009) 29
3. Backpropagation Through Structure Introduced by Goller & Küchler (1996) Principally the same as general backpropagation ⇣ ( W ( l ) ) T δ ( l +1) ⌘ ∂ ∂ W ( l ) E R = δ ( l +1) ( a ( l ) ) T + λ W ( l ) . δ ( l ) = � f 0 ( z ( l ) ) , Calculations resulting from the recursion and tree structure: 1. Sum derivatives of W from all nodes (like RNN) 2. Split derivatives at each node (for tree) 3. Add error messages from parent + node itself 30
BTS: 1) Sum derivatives of all nodes You can actually assume it’s a different W at each node Intuition via example: If we take separate derivatives of each occurrence, we get same: 31
BTS: 2) Split derivatives at each node During forward prop, the parent is computed using 2 children 8 3 c 1 p = tanh ( W + b ) 3 c 2 8 c 1 c 2 3 5 Hence, the errors need to be computed wrt each of them: 8 3 where each child’s error is n -dimensional 3 8 c 1 3 c 2 5 32
BTS: 3) Add error messages • At each node: • What came up (fprop) must come down (bprop) • Total error messages = error messages from parent + error message from own score parent score 8 3 3 8 c 1 c 2 3 5 33
BTS Python Code: forwardProp 34
BTS Python Code: backProp ⇣ ( W ( l ) ) T δ ( l +1) ⌘ δ ( l ) = � f 0 ( z ( l ) ) , ∂ ∂ W ( l ) E R = δ ( l +1) ( a ( l ) ) T + λ W ( l ) . 35
Discussion: Simple TreeRNN • Decent results with single matrix TreeRNN • Single weight matrix TreeRNN could capture some phenomena but not adequate for more complex, higher order composition and parsing long sentences • There is no real interaction between the input words s • The composition function is the same W score p for all syntactic categories, punctuation, etc. W c 1 c 2 36
Recommend
More recommend