compositionality in semantic vector spaces
play

Compositionality in Semantic Vector Spaces CS224U: Natural Language - PowerPoint PPT Presentation

Compositionality in Semantic Vector Spaces CS224U: Natural Language Understanding Feb. 28, 2012 Richard Socher Joint work with Chris Manning, Andrew Ng Jeffrey Pennington, Eric Huang and Cliff Lin More information and code at www.socher.org 1


  1. Compositionality in Semantic Vector Spaces CS224U: Natural Language Understanding Feb. 28, 2012 Richard Socher Joint work with Chris Manning, Andrew Ng Jeffrey Pennington, Eric Huang and Cliff Lin More information and code at www.socher.org 1 Socher, Manning, Ng

  2. Word Vector Space Models Each word is associated with an n-dimensional vector. x 2 1 5 5 4 1.1 4 1 Germany 3 3 9 2 France 2 2 Monday 2.5 Tuesday 9.5 1 1.5 0 1 2 3 4 5 6 7 8 9 10 x 1 the country of my birth the place where I was born But how can we represent the meaning of longer phrases? By mapping them into the same vector space! 2 Socher, Manning, Ng

  3. How should we map phrases into a vector space? Use the principle of compositionality! The meaning (vector) of a sentence is determined by x 2 (1)the meanings of its words and the country of my birth 5 (2)the rules that combine them. the place where I was born 4 Germany 3 France 1 Monday 2 5 Tuesday 1 0 1 2 3 4 5 6 7 8 9 10 x 1 5.5 6.1 Algorithm jointly learns compositional vector 1 2.5 3.5 3.8 representations (and 0.4 2.1 7 4 2.3 tree structure). 0.3 3.3 7 4.5 3.6 the country of my birth 3 Socher, Manning, Ng

  4. Outline Goal: Algorithms that recover and learn semantic vector representations based on recursive structure for multiple language tasks. 1. Introduction s W score p W c 1 c 2 2. Word Vectors and Recursive Neural Networks 3. Recursive Autoencoders for Sentiment Analysis 4. Paraphrase Detection 4 Socher, Manning, Ng

  5. Distributional Word Representations x 2 8 5 In 5 4 1 Germany 3 3 9 2 France 2 2 Monday 2.5 Tuesday 9.5 1 1.5 0 1 2 3 4 5 6 7 8 9 10 x 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 France Monday 5 Socher, Manning, Ng

  6. Algorithms for finding word vector representations There are many well known algorithms that use cooccurrence statistics to compute a distributional representation for words • (Brown et al., 1992; Turney et al., 2003 and many others). • LSA (Landauer & Dumais, 1997). • Latent Dirichlet Allocation (LDA; Blei et al., 2003) Recent development: “Neural Language models.” • Bengio et al., (2003) introduced a language model to predict words given previous words which also learns vector representations. • Collobert & Weston (2008), Maas et al. (2011) from last lecture 6 Socher, Manning, Ng

  7. Distributional Word Representations Recent development: “Neural language models” Collobert & Weston, 2008, Turian et al, 2010 7 Socher, Manning, Ng

  8. Vectorial Sentence Meaning - Step 1: Parsing S VP AdjP NP AdjP 9 5 7 8 9 4 1 3 1 5 1 3 The movie was not really exciting. 8 Socher, Manning, Ng

  9. Vectorial Sentence Meaning - Step 2: Vectors at each node 5 S 4 VP 7 3 8 AdjP 3 5 NP 2 3 AdjP 3 9 5 7 8 9 4 1 3 1 5 1 3 The movie was not really exciting. 9 Socher, Manning, Ng

  10. Recursive Neural Networks for Structure Prediction Basic computational unit: Recursive Neural Network Inputs: two candidate children’s representations Outputs: 1. The semantic representation if the two nodes are merged. 2. Label that carries some information 8 3 about this node 8 label 3 3 3 Neural Network 8 9 4 5 1 3 not really exciting 8 3 5 3 10 Socher, Manning, Ng

  11. Recursive Neural Network Definition 8 label 3 c 1 p = sigmoid ( W + b ) , c 2 Neural where sigmoid: Network 8 3 5 3 gives a distribution over a set of labels: c 1 c 2 11 Socher, Manning, Ng

  12. Recursive Neural Network Definition 8 label Related Work: 3 • Previous RNN work (Goller & Küchler (1996), Costa et al. (2003)) Neural Network • assumed fixed tree structure and used one hot vectors. • No softmax classifiers 8 3 5 3 • Jordan Pollack (1990): Recursive auto- c 1 c 2 associative memories (RAAMs) • Hinton 1990 and Bottou (2011): Related ideas about recursive models. 12 Socher, Manning, Ng

  13. Goal: Predict Pos/Neg Sentiment of Full Sentence 0.3 5 4 7 3 8 3 5 2 3 3 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The movie was not really exciting. 13 Socher, Manning, Ng

  14. Predicting Sentiment with RNNs 0.5 0.5 0.5 0.3 0.5 0.7 9 5 7 8 9 4 1 3 1 5 1 3 The movie was not really exciting. 14 Socher, Manning, Ng

  15. Predicting Sentiment with RNNs c 1 p = sigmoid ( W + b ) c 2 5 3 2 3 0.5 0.9 Neural Neural Network Network 9 5 7 8 9 4 1 3 1 5 1 3 The movie was not really exciting. 15 Socher, Manning, Ng

  16. Predicting Sentiment with RNNs 8 3 0.3 Neural Network 5 2 3 3 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The movie was not really exciting. 16 Socher, Manning, Ng

  17. Predicting Sentiment with RNNs 8 3 5 2 3 3 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The movie was not really exciting. 17 Socher, Manning, Ng

  18. Predicting Sentiment with RNNs 8 3 0.3 Neural Network 7 3 8 3 5 2 3 3 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The movie was not really exciting. 18 Socher, Manning, Ng

  19. Outline Goal: Algorithms that recover and learn semantic vector representations based on recursive structure for multiple language tasks. 1. Introduction s W score p W 2. Word Vectors and Recursive Neural Networks c 1 c 2 3. Recursive Autoencoders for Sentiment Analysis [Socher et al., EMNLP 2011] 4. Paraphrase Detection 19 Socher, Manning, Ng

  20. Sentiment Detection and Bag-of-Words Models • Sentiment detection is crucial to business intelligence, stock trading, … 20 Socher, Manning, Ng

  21. Sentiment Detection and Bag-of-Words Models • Sentiment detection is crucial to business intelligence, stock trading, … • Most methods start with a bag of words + linguistic features/processing/lexica • But such methods (including tf-idf ) can’t distinguish: + white blood cells destroying an infection - an infection destroying white blood cells 21 Socher, Manning, Ng

  22. Single Scale Experiments: Movies Stealing Harvard doesn't care about cleverness, wit or any other kind of intelligent humor. A film of ideas and wry comic mayhem. 22 Socher, Manning, Ng

  23. Recursive Autoencoders • Main Idea: A phrase vector is good, if it keeps as much information as possible about its children. 8 label 3 Neural Network 8 3 5 3 c 1 c 2 23 Socher, Manning, Ng

  24. Recursive Autoencoders • Similar to RNN but with 2 differences: (1) Reconstruction error to keep as much information as possible Reconstruction error Softmax Classifier 8 label 3 W (2) W (label) Neural Network W (1) 8 3 5 3 c 1 c 2 c 1 p = sigmoid ( W + b ) c 2 24 Socher, Manning, Ng

  25. Recursive Autoencoders • Reconstruction error details Reconstruction error Softmax Classifier W (2) W (label) W (1) 25 Socher, Manning, Ng

  26. Recursive Autoencoders • Reconstruction error at every node • Important detail: normalization p 2 =f(W[x 1 ;p 1 ] + b) p 1 =f(W[x 2 ;x 3 ] + b) x 1 x 2 x 3 26 Socher, Manning, Ng

  27. Recursive Autoencoders • Similar to RNN but with 2 differences: (2) Tree structure is determined by reconstruction error: – does not require a parser – get task dependent trees 1 5 0 2 3 0 2 1 5.4 0 3 0.6 2.3 3.1 0.7 Neural Neural Neural Neural Neural Network Network Network Network Network 9 5 7 8 9 4 1 3 1 5 1 3 The movie was not really exciting. 27 Socher, Manning, Ng

  28. Recursive Autoencoders 2 1 0.9 Neural 1 2 3 Network 0 0 5.4 3 3.1 0.7 5 Neural Neural Neural 2 Network Network Network 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The movie was not really exciting. 28 Socher, Manning, Ng

  29. Recursive Autoencoders 8 2 3 0.7 1 0.9 Neural Neural 2 Network Network 0 3.1 5 Neural 2 3 Network 3 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The movie was not really exciting. 29 Socher, Manning, Ng

  30. Recursive Autoencoders 5 4 7 3 8 3 5 2 3 3 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The movie was not really exciting. 30 Socher, Manning, Ng

  31. RAE Training • Lower error over entire sentence x and its label t (+ regularization) • Error of a sentence is the error at all nodes in its tree: 31 Socher, Manning, Ng

  32. RAE Training • Error at each node is a weighted combination of reconstruction error and cross-entropy (distribution likelihood) from softmax classifier Reconstruction error Cross-entropy error W (2) W (label) W (1) 32 Socher, Manning, Ng

  33. Details for Training RNNs • Minimizing error by taking gradient steps computed from matrix derivatives • More efficient implementation via the backpropagation algorithm • Since we compute derivatives in a tree structure we can, we call it backpropagation through structure (Goller et al. 1996) 33 Socher, Manning, Ng

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend