Compositionality in Semantic Vector Spaces CS224U: Natural Language - PowerPoint PPT Presentation

Compositionality in Semantic Vector Spaces CS224U: Natural Language Understanding Feb. 28, 2012 Richard Socher Joint work with Chris Manning, Andrew Ng Jeffrey Pennington, Eric Huang and Cliff Lin More information and code at www.socher.org 1 Socher, Manning, Ng

Word Vector Space Models Each word is associated with an n-dimensional vector. x 2 1 5 5 4 1.1 4 1 Germany 3 3 9 2 France 2 2 Monday 2.5 Tuesday 9.5 1 1.5 0 1 2 3 4 5 6 7 8 9 10 x 1 the country of my birth the place where I was born But how can we represent the meaning of longer phrases? By mapping them into the same vector space! 2 Socher, Manning, Ng

How should we map phrases into a vector space? Use the principle of compositionality! The meaning (vector) of a sentence is determined by x 2 (1)the meanings of its words and the country of my birth 5 (2)the rules that combine them. the place where I was born 4 Germany 3 France 1 Monday 2 5 Tuesday 1 0 1 2 3 4 5 6 7 8 9 10 x 1 5.5 6.1 Algorithm jointly learns compositional vector 1 2.5 3.5 3.8 representations (and 0.4 2.1 7 4 2.3 tree structure). 0.3 3.3 7 4.5 3.6 the country of my birth 3 Socher, Manning, Ng

Outline Goal: Algorithms that recover and learn semantic vector representations based on recursive structure for multiple language tasks. 1. Introduction s W score p W c 1 c 2 2. Word Vectors and Recursive Neural Networks 3. Recursive Autoencoders for Sentiment Analysis 4. Paraphrase Detection 4 Socher, Manning, Ng

Distributional Word Representations x 2 8 5 In 5 4 1 Germany 3 3 9 2 France 2 2 Monday 2.5 Tuesday 9.5 1 1.5 0 1 2 3 4 5 6 7 8 9 10 x 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 France Monday 5 Socher, Manning, Ng

Algorithms for finding word vector representations There are many well known algorithms that use cooccurrence statistics to compute a distributional representation for words • (Brown et al., 1992; Turney et al., 2003 and many others). • LSA (Landauer & Dumais, 1997). • Latent Dirichlet Allocation (LDA; Blei et al., 2003) Recent development: “Neural Language models.” • Bengio et al., (2003) introduced a language model to predict words given previous words which also learns vector representations. • Collobert & Weston (2008), Maas et al. (2011) from last lecture 6 Socher, Manning, Ng

Distributional Word Representations Recent development: “Neural language models” Collobert & Weston, 2008, Turian et al, 2010 7 Socher, Manning, Ng

Vectorial Sentence Meaning - Step 1: Parsing S VP AdjP NP AdjP 9 5 7 8 9 4 1 3 1 5 1 3 The movie was not really exciting. 8 Socher, Manning, Ng

Vectorial Sentence Meaning - Step 2: Vectors at each node 5 S 4 VP 7 3 8 AdjP 3 5 NP 2 3 AdjP 3 9 5 7 8 9 4 1 3 1 5 1 3 The movie was not really exciting. 9 Socher, Manning, Ng

Recursive Neural Networks for Structure Prediction Basic computational unit: Recursive Neural Network Inputs: two candidate children’s representations Outputs: 1. The semantic representation if the two nodes are merged. 2. Label that carries some information 8 3 about this node 8 label 3 3 3 Neural Network 8 9 4 5 1 3 not really exciting 8 3 5 3 10 Socher, Manning, Ng

Recursive Neural Network Definition 8 label 3 c 1 p = sigmoid ( W + b ) , c 2 Neural where sigmoid: Network 8 3 5 3 gives a distribution over a set of labels: c 1 c 2 11 Socher, Manning, Ng

Recursive Neural Network Definition 8 label Related Work: 3 • Previous RNN work (Goller & Küchler (1996), Costa et al. (2003)) Neural Network • assumed fixed tree structure and used one hot vectors. • No softmax classifiers 8 3 5 3 • Jordan Pollack (1990): Recursive auto- c 1 c 2 associative memories (RAAMs) • Hinton 1990 and Bottou (2011): Related ideas about recursive models. 12 Socher, Manning, Ng

Goal: Predict Pos/Neg Sentiment of Full Sentence 0.3 5 4 7 3 8 3 5 2 3 3 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The movie was not really exciting. 13 Socher, Manning, Ng

Predicting Sentiment with RNNs 0.5 0.5 0.5 0.3 0.5 0.7 9 5 7 8 9 4 1 3 1 5 1 3 The movie was not really exciting. 14 Socher, Manning, Ng

Predicting Sentiment with RNNs c 1 p = sigmoid ( W + b ) c 2 5 3 2 3 0.5 0.9 Neural Neural Network Network 9 5 7 8 9 4 1 3 1 5 1 3 The movie was not really exciting. 15 Socher, Manning, Ng

Predicting Sentiment with RNNs 8 3 0.3 Neural Network 5 2 3 3 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The movie was not really exciting. 16 Socher, Manning, Ng

Predicting Sentiment with RNNs 8 3 5 2 3 3 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The movie was not really exciting. 17 Socher, Manning, Ng

Predicting Sentiment with RNNs 8 3 0.3 Neural Network 7 3 8 3 5 2 3 3 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The movie was not really exciting. 18 Socher, Manning, Ng

Outline Goal: Algorithms that recover and learn semantic vector representations based on recursive structure for multiple language tasks. 1. Introduction s W score p W 2. Word Vectors and Recursive Neural Networks c 1 c 2 3. Recursive Autoencoders for Sentiment Analysis [Socher et al., EMNLP 2011] 4. Paraphrase Detection 19 Socher, Manning, Ng

Sentiment Detection and Bag-of-Words Models • Sentiment detection is crucial to business intelligence, stock trading, … 20 Socher, Manning, Ng

Sentiment Detection and Bag-of-Words Models • Sentiment detection is crucial to business intelligence, stock trading, … • Most methods start with a bag of words + linguistic features/processing/lexica • But such methods (including tf-idf ) can’t distinguish: + white blood cells destroying an infection - an infection destroying white blood cells 21 Socher, Manning, Ng

Single Scale Experiments: Movies Stealing Harvard doesn't care about cleverness, wit or any other kind of intelligent humor. A film of ideas and wry comic mayhem. 22 Socher, Manning, Ng

Recursive Autoencoders • Main Idea: A phrase vector is good, if it keeps as much information as possible about its children. 8 label 3 Neural Network 8 3 5 3 c 1 c 2 23 Socher, Manning, Ng

Recursive Autoencoders • Similar to RNN but with 2 differences: (1) Reconstruction error to keep as much information as possible Reconstruction error Softmax Classifier 8 label 3 W (2) W (label) Neural Network W (1) 8 3 5 3 c 1 c 2 c 1 p = sigmoid ( W + b ) c 2 24 Socher, Manning, Ng

Recursive Autoencoders • Reconstruction error details Reconstruction error Softmax Classifier W (2) W (label) W (1) 25 Socher, Manning, Ng

Recursive Autoencoders • Reconstruction error at every node • Important detail: normalization p 2 =f(W[x 1 ;p 1 ] + b) p 1 =f(W[x 2 ;x 3 ] + b) x 1 x 2 x 3 26 Socher, Manning, Ng

Recursive Autoencoders • Similar to RNN but with 2 differences: (2) Tree structure is determined by reconstruction error: – does not require a parser – get task dependent trees 1 5 0 2 3 0 2 1 5.4 0 3 0.6 2.3 3.1 0.7 Neural Neural Neural Neural Neural Network Network Network Network Network 9 5 7 8 9 4 1 3 1 5 1 3 The movie was not really exciting. 27 Socher, Manning, Ng

Recursive Autoencoders 2 1 0.9 Neural 1 2 3 Network 0 0 5.4 3 3.1 0.7 5 Neural Neural Neural 2 Network Network Network 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The movie was not really exciting. 28 Socher, Manning, Ng

Recursive Autoencoders 8 2 3 0.7 1 0.9 Neural Neural 2 Network Network 0 3.1 5 Neural 2 3 Network 3 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The movie was not really exciting. 29 Socher, Manning, Ng

Recursive Autoencoders 5 4 7 3 8 3 5 2 3 3 9 5 5 7 8 9 4 1 3 3 1 5 1 3 The movie was not really exciting. 30 Socher, Manning, Ng

RAE Training • Lower error over entire sentence x and its label t (+ regularization) • Error of a sentence is the error at all nodes in its tree: 31 Socher, Manning, Ng

RAE Training • Error at each node is a weighted combination of reconstruction error and cross-entropy (distribution likelihood) from softmax classifier Reconstruction error Cross-entropy error W (2) W (label) W (1) 32 Socher, Manning, Ng

Details for Training RNNs • Minimizing error by taking gradient steps computed from matrix derivatives • More efficient implementation via the backpropagation algorithm • Since we compute derivatives in a tree structure we can, we call it backpropagation through structure (Goller et al. 1996) 33 Socher, Manning, Ng

Compositionality in Semantic Vector Spaces CS224U: Natural Language - PowerPoint PPT Presentation

Compositionality in Semantic Vector Spaces CS224U: Natural Language Understanding Feb. 28, 2012 Richard Socher Joint work with Chris Manning, Andrew Ng Jeffrey Pennington, Eric Huang and Cliff Lin More information and code at www.socher.org 1

Compositionality in Semantic Spaces Martha Lewis ILLC University of Amsterdam 2nd Symposium on

Exploiting multilingual lexical resources to predict the compositionality of MWEs Paul Cook

Beyond Fields: Vector Spaces and Algebras Bernd Schr oder logo1 Bernd Schr oder

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Compositionality and Asynchrony Dr. Liam OConnor University of Edinburgh LFCS (and UNSW) Term

Compositionality and Asynchrony Dr. Liam OConnor University of Edinburgh LFCS (and UNSW) Term

Distributional Compositionality Compositionality in DS Raffaella Bernardi University of Trento

Tyrol Hill Park Phase 4 Elementary Campbell Elementary Campbell Park Spaces Open Park

Recursive Matrix-Vector Spaces COURSE PROJECT OF CS365A SONU AGARWAL VIVEKA KULHARIA Goal

Adaptive Multi-Compositionality for Recursive Neural Models with Applications to Sentiment

Math 221: LINEAR ALGEBRA Chapter 6. Vector Spaces 6-2. Vector Spaces - Examples and Basic

Math 221: LINEAR ALGEBRA Chapter 6. Vector Spaces 6-3. Vector Spaces - Linear Independence Le

The Geometry of Vector Spaces x E N : vector x belongs to an N -dimensional Euclidean space.

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

RECURSIVE DEEP MODELS FOR SEMANTIC 1 COMPOSITIONALITY Zhicong Lu DGP Lab

Brandt Elementary Back to School Night Principal Charles Bartlett Vice Principal Jack Baker

A Verified Information-Flow Architecture Arthur Azevedo de Amorim, Nathan Collins, Andr DeHon,

Separation of cliques and stable sets Nicolas Bousquet Aur elie Lagoutte St ephan

Bitonic st -orderings for Upward Planar Graphs Martin Gronemann University of Cologne, Germany

Automatic Generation of Compact Printable Shellcodes For x86 WOOT 20 Dhrumil Patel Aditya

Belief - Desire - Intention (BDI) Model BDI Introduction, Applications and Analyses Massimo

Divide and conquer roadmap : deciding connectivity for real alge- braic sets Marie-Franoise

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel