Outline Convolutional Neural Network Architectures for Matching - - PowerPoint PPT Presentation

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Convolutional Neural Network Architectures for Matching - - PowerPoint PPT Presentation

Outline Hu, NIPS14 Irsoy, NIPS14 Outline Convolutional Neural Network Architectures for Matching Natural Language Sentences. NIPS14 Convolutional Sentence Model Convolutional Matching Models Experiments Deep Recursive Neural


slide-1
SLIDE 1

Outline Hu, NIPS’14 Irsoy, NIPS’14

Outline

Convolutional Neural Network Architectures for Matching Natural Language Sentences. NIPS’14 Convolutional Sentence Model Convolutional Matching Models Experiments Deep Recursive Neural Networks for Compositionality in Language. NIPS’14 Deep Recursive Neural Networks Experiments

LU Yangyang luyy11@sei.pku.edu.cn

  • Jan. 14, 2015
slide-2
SLIDE 2

Outline Hu, NIPS’14 Irsoy, NIPS’14

Outline

Convolutional Neural Network Architectures for Matching Natural Language Sentences. NIPS’14 Convolutional Sentence Model Convolutional Matching Models Experiments Deep Recursive Neural Networks for Compositionality in Language. NIPS’14

slide-3
SLIDE 3

Outline Hu, NIPS’14 Irsoy, NIPS’14

Authors

  • Convolutional Neural Network Architectures for Matching Natural

Language Sentences

  • NIPS’14
  • Baotian Hu1, Zhengdong Lu2, Hang Li2, and Qingcai Chen1

1 Harbin Institute of Technology, Shenzhen Graduate School 2 Noah’s Ark Lab, Huawei Technologies Co. Ltd.

slide-4
SLIDE 4

Outline Hu, NIPS’14 Irsoy, NIPS’14

Introduction

Matching two potentially heterogenous language objects:

  • to model the correspondence between “linguistic objects” of different

nature at different levels of abstractions

  • generalizes the conventional notion of similarity or relevance
  • related tasks: top-k re-ranking in machine translation, dialogue, para-

phrase identification

slide-5
SLIDE 5

Outline Hu, NIPS’14 Irsoy, NIPS’14

Introduction

Matching two potentially heterogenous language objects:

  • to model the correspondence between “linguistic objects” of different

nature at different levels of abstractions

  • generalizes the conventional notion of similarity or relevance
  • related tasks: top-k re-ranking in machine translation, dialogue, para-

phrase identification Natural language sentences:

  • complicated structures: sequential & hierarchical

Sentence matching:

  • the internal structures of sentences
  • the rich patterns in their interactions
slide-6
SLIDE 6

Outline Hu, NIPS’14 Irsoy, NIPS’14

Introduction

Matching two potentially heterogenous language objects:

  • to model the correspondence between “linguistic objects” of different

nature at different levels of abstractions

  • generalizes the conventional notion of similarity or relevance
  • related tasks: top-k re-ranking in machine translation, dialogue, para-

phrase identification Natural language sentences:

  • complicated structures: sequential & hierarchical

Sentence matching:

  • the internal structures of sentences
  • the rich patterns in their interactions

→ adapting the convolutional strategy to natural language

  • the hierarchical composition for sentences
  • the simple-to-comprehensive fusion of matching patterns
slide-7
SLIDE 7

Outline Hu, NIPS’14 Irsoy, NIPS’14

Convolutional Sentence Model

Convolution: Given sentence input x, the convolution unit for feature map of type-f (among Fl of them) on Layer-l: z(l,f)

i

(x) the output of feature map of type-f for location i in Layer-l W(l,f) the parameters for f on Layer-l σ(·) the activation function (Sigmoid or Relu) ˆ z(l−1)

i

the segment of Layer-(l − 1) for the convolution at location i

slide-8
SLIDE 8

Outline Hu, NIPS’14 Irsoy, NIPS’14

Convolutional Sentence Model (cont.)

Max-Pooling: in every two-unit window for every f

  • shrinks the size of the representation by half → quickly absorbs the differences in

length for sentence representation

  • filters out undesirable composition of words

Length Variability:

  • putting all-zero padding vectors after the last word of the sentence until the

maximum length

  • To eliminate the boundary effect caused by the great variability of sentence lengths:

adding a gate g(v) to the convolutional unit which sets output vectors to all-zeros if the input is all zeros

slide-9
SLIDE 9

Outline Hu, NIPS’14 Irsoy, NIPS’14

Some Analysis on the Convolutional Architecture

The convolutional unit + max- pooling: the compositional operator with lo- cal selection mechanism as in the recursive autoencoder

slide-10
SLIDE 10

Outline Hu, NIPS’14 Irsoy, NIPS’14

Some Analysis on the Convolutional Architecture

The convolutional unit + max- pooling: the compositional operator with lo- cal selection mechanism as in the recursive autoencoder Compared to Recursive Models:

  • does not take a single path of word/phrase composition (by a separate gating

function, an external parser, or just natural sequential order)

  • takes multiple choices of composition via a large feature map and leaves the

choices to the pooling afterwards to pick the more appropriate segments for each composition

  • limitation of the convolutional architecture: a fixed depth → bounding the level
  • f composition it could do

Relation to “Shallow” Convolutional Models:

  • SENNA-type architecture: a convolution layer (local) and a max-pooling layer

(global) → lost sentence-level sequential order

  • the superset of SENNA-type architectures
slide-11
SLIDE 11

Outline Hu, NIPS’14 Irsoy, NIPS’14

Architecture-I (ARC-I)1

Convolutional Matching Models

  • Finding the representation of each

sentence

  • Comparing the representation for

the two sentences with a multi- layer perceptron (MLP)

1the Siamese architecture: B. Antoine,et al. A semantic matching energy function for learning with multi-relational data. Machine Learning, 94(2):233–259, 2014.

slide-12
SLIDE 12

Outline Hu, NIPS’14 Irsoy, NIPS’14

Architecture-I (ARC-I)1

Convolutional Matching Models

  • Finding the representation of each

sentence

  • Comparing the representation for

the two sentences with a multi- layer perceptron (MLP)

The Drawback of ARC-I:

  • defers the interaction between two sentences to until their individual

representation matures → runs at the risk of losing details important for the matching task in representing the sentences

  • the representation of each sentence is formed without knowledge of

each other, and this cannot be adequately circumvented in backward phase (learning)

1the Siamese architecture: B. Antoine,et al. A semantic matching energy function for learning with multi-relational data. Machine Learning, 94(2):233–259, 2014.

slide-13
SLIDE 13

Outline Hu, NIPS’14 Irsoy, NIPS’14

Architecture-II (ARC-II)

Convolutional Matching Models ARC-II:Built directly on the interaction space between two sentences

  • letting two sentences meet before their own high-level representations mature
  • still retaining the space for the individual development of abstraction of each

sentence Layer-1: “one-dimensional” (1D) convolutions For segment i on SX and segment j on SY : Layer-2: a 2D max-pooling in non-overlapping 2 × 2 windows Layer-3: a 2D convolution on k3 × k3 windows of output from Layer-2

slide-14
SLIDE 14

Outline Hu, NIPS’14 Irsoy, NIPS’14

Some Analysis on ARC-II

Convolutional Matching Models

Order Preservation:

  • Both the convolution and pooling operation in

ARC-II have this order preserving property.

  • Generally,

z(l)

i,j

contains information about the words in SX before those in z(l)

i+1,j, although they

may be generated with slightly different segments in SY , due to the 2D pooling.

Model Generality:

  • ARC-II actually subsumes ARC-I as a special case
slide-15
SLIDE 15

Outline Hu, NIPS’14 Irsoy, NIPS’14

Training

Objective: negative sampling + a large margin

  • Stochastic gradient descent with mini-batch (100 ∼ 200 in sizes)
  • Regularization:
  • early stopping: enough for models with medium size and large training sets (with
  • ver 500K instances)
  • early stopping + dropout: For small datasets (less than 10k training instances)
  • Initialized input: 50-dimensional word embeddings with the Word2Vec
  • English: learnt on Wikipedia (∼ 1B words)
  • Chinese: learnt on Weibo data (∼ 300M words)
  • Convolution:
  • 3-word window throughout all experiments
  • test various numbers of feature maps (typically from 200 to 500)
  • Architecture:
  • ARC-II for all tasks: 8 layers (3 convolutions + 3 poolings + 2 MLPs)
  • ARC-I: less layers (2 convolutions + 2 poolings + 2 MLPs) and more hidden

nodes

slide-16
SLIDE 16

Outline Hu, NIPS’14 Irsoy, NIPS’14

Tasks & Competitor Methods

Three tasks:

  • Matching language objects of heterogenous natures
  • I. Sentence Completion
  • II. Tweet-Response Matching
  • Matching homogeneous objects
  • III. Paraphrase Identification

Competitor Methods:

  • WordEmbed: represent each short-text as the sum of the embedding of the

words it contains and match two documents by MLP

  • DeepMatch2: 3 hidden layers and 1,000 hidden nodes in 1st hidden layers
  • uRAE+MLP3: unfolding RAE, each sentence represented as a 100-dimensional

vector

  • SENNA+MLP/sim: the SENNA-type sentence model
  • SenMLP: take the whole sentence as input, and use an MLP to obtain the score
  • f coherence
  • 2Z. Lu and H. Li. A deep architecture for matching short texts. In Advances in NIPS, 2013
  • 3R. Socher, E. H. Huang, and A. Y. Ng. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Advances

in NIPS, 2011.

slide-17
SLIDE 17

Outline Hu, NIPS’14 Irsoy, NIPS’14

Experiment I: Sentence Completion

  • take a sentence from Reuters with two “balanced” clauses (with 8 × 28 words)

divided by one comma

  • use the first clause as SX and the second as SY
  • TASK: to recover the original second clause for any given first clause
  • make the task harder: using negative second clauses similar to the original ones,

both in training and testing

  • training: 3 million triples (from 600K positive pairs)
  • test: 50K positive pairs × 4 negatives

SX: Although the state has only four votes in the Electoral College, S+

Y : its loss would be a symbolic blow to republican

presidential candidate Bob Dole. S−

Y : but it failed to garner enough votes to override

an expected veto by president Clinton.

slide-18
SLIDE 18

Outline Hu, NIPS’14 Irsoy, NIPS’14

Experiment II: Matching A Response to A Tweet

  • training: 4.5 million original (tweet, response) pairs collected from Weibo
  • training: each positive pair × 10 random responses as negative examples → 45

million triples

  • the writing style: obviously more free and informal (than Experiment I)
  • test: 300K original (tweet, response) pairs × 4 random negatives

An example from Weibo: (translated to English) SX: Damn, I have to work overtime this weekend! S+

Y : Try to have some rest buddy.

S−

Y : It is hard to find a job,

better start polishing your resume.

slide-19
SLIDE 19

Outline Hu, NIPS’14 Irsoy, NIPS’14

Experiment III: Paraphrase Identification

Paraphrase Identification:

  • MSRP dataset:

training/test – 4,076/1,725

  • the state-of-the-art:

Acc./F1 – 76.8%/83.6%

slide-20
SLIDE 20

Outline Hu, NIPS’14 Irsoy, NIPS’14

Experiment III: Paraphrase Identification

Paraphrase Identification:

  • MSRP dataset:

training/test – 4,076/1,725

  • the state-of-the-art:

Acc./F1 – 76.8%/83.6% Discussions:

  • ARC-II outperforms others significantly when the training instances

are relatively abundant

  • convolutional models (ARC-I & II, SENNA+MLP) perform favorably
  • ver bag-of-words models
  • simple sum of embedding learned via Word2Vec yields reasonably

good results on all three tasks

slide-21
SLIDE 21

Outline Hu, NIPS’14 Irsoy, NIPS’14

Summary

A successful sentence-matching algorithm needs to capture not only the internal structures of sentences but also the rich patterns in their interactions.

Convolutional Sentence Model:

  • Convolution (with A gate on convolutional units) Layers + Max-

Pooling Layers Convolutional Matching Models:

  • ARC-I: separated convolutional models for two sentences + MLP
  • ARC-II: convolutional models on an interaction matrix + MLP

Three Tasks:

  • Matching language objects of heterogenous natures
  • I. Sentence Completion: Reuters
  • II. Tweet-Response Matching: Weibo
  • Matching homogeneous objects
  • III. Paraphrase Identification: MSRP dataset
slide-22
SLIDE 22

Outline Hu, NIPS’14 Irsoy, NIPS’14

Outline

Convolutional Neural Network Architectures for Matching Natural Language Sentences. NIPS’14 Deep Recursive Neural Networks for Compositionality in Language. NIPS’14 Deep Recursive Neural Networks Experiments

slide-23
SLIDE 23

Outline Hu, NIPS’14 Irsoy, NIPS’14

Authors

  • Deep Recursive Neural Networks for Compositionality in Language
  • NIPS’14
  • Ozan Irsoy and Claire Cardie

Department of Computer Science, Cornell University

slide-24
SLIDE 24

Outline Hu, NIPS’14 Irsoy, NIPS’14

Introduction

Recursive neural networks:

  • Given the structural representation of a sentence, e.g. a parse tree, they recursively

generate parent representations in a bottom-up fashion, by combining tokens to produce representations for phrases, eventually producing the whole sentence.

  • A recursive neural network can be seen as a generalization of the recurrent neural

network, which has a specific type of skewed tree structure. depth in time v.s. depth in space

  • deep recurrent networks: constructed by stacking multiple recurrent layers on

top of each other

slide-25
SLIDE 25

Outline Hu, NIPS’14 Irsoy, NIPS’14

Introduction

Recursive neural networks:

  • Given the structural representation of a sentence, e.g. a parse tree, they recursively

generate parent representations in a bottom-up fashion, by combining tokens to produce representations for phrases, eventually producing the whole sentence.

  • A recursive neural network can be seen as a generalization of the recurrent neural

network, which has a specific type of skewed tree structure. depth in time v.s. depth in space

  • deep recurrent networks: constructed by stacking multiple recurrent layers on

top of each other → deep recursive neural network: stacking multiple recursive layers

  • A layer can learn some parts of the composition to apply, and pass this inter-

mediate representation to the next layer for further processing for the remaining parts of the overall composition.

slide-26
SLIDE 26

Outline Hu, NIPS’14 Irsoy, NIPS’14

Recursive Neural Networks

  • Given a positional directed acyclic graph, it visits the nodes in topological order,

and recursively applies transformations to generate further representations from previously computed representations of children.

  • Given a binary tree structure with leaves having the initial representations:

The aforementioned definition: treats the leaf nodes and internal nodes the same → Untying Leaves and Internals: distinguishes between a leaf and an internal node

slide-27
SLIDE 27

Outline Hu, NIPS’14 Irsoy, NIPS’14

Untying Leaves and Internals

  • A simple parametrization of the weights W with respect to whether the incoming

edge emanates from a leaf or an internal node hη = xη ∈ X if η a leaf and hη ∈ H otherwise W η

· = W xh ·

if η a leaf and W η

· = W hh ·

  • therwise

X and H are vector spaces of words and phrases

  • With this untying, a recursive network becomes a generalization of the Elman

type recurrent neural network with h being analogous to the hidden layer of the recurrent network (memory) and x being analogous to the input layer.

slide-28
SLIDE 28

Outline Hu, NIPS’14 Irsoy, NIPS’14

Deep Recursive Neural Networks

  • Recursive neural networks are deep in structure: as deep as the depth of the tree

→ this notion of depth is unlikely to involve a hierarchical interpretation of the data.

  • In the more conventional stacked deep learners, an important benefit of depth is

the hierarchy among hidden representations: every hidden layer conceptually lies in a different representation space and poten- tially is a more abstract representation of the input than the previous layer.

slide-29
SLIDE 29

Outline Hu, NIPS’14 Irsoy, NIPS’14

Deep Recursive Neural Networks (cont.)

  • Stacking multiple layers of individual recursive nets:

i: the indices of the multiple stacked layers V (i): the weight matrix, connecting the hidden layer-(i − 1) to the layer-i

  • Connecting the output layer to only the final hidden layer:

l: the total number of layers

slide-30
SLIDE 30

Outline Hu, NIPS’14 Irsoy, NIPS’14

Experiments

  • Dataset: Stanford Sentiment Treebank (SST)
  • sentiment labels for 215,154 phrases in the parse trees of 11,855

sentences

  • average sentence length: 19.1 tokens
  • classification: 5-class(- -/-/0/+/+ +), 2-class(pos/neg)
  • Competitor Methods:

BiNB: a naive bayes classifier that operates on bigram counts shallow RNN: learns vectors via a binary parsing tree MV-RNN: every word is assigned a matrix-vector pair instead of a vector RNTN: the recursive neural tensor network, in which the composition is defined as a bilinear tensor product Paragraph Vectors: learns representations for larger pieces of text using similar models of Word2Vec DCNN: the dynamic convolutional neural network

slide-31
SLIDE 31

Outline Hu, NIPS’14 Irsoy, NIPS’14

Experiments (cont.)

  • Word Vectors:
  • fixed in all the experiment and do not fine-tune
  • 300 dimensional word vectors by Word2Vec trained on part of the

Google News dataset (∼ 100B words).

  • Regularizer: dropout
  • Training:
  • use stochastic gradient descent with a fixed learning rate (.01)
  • use a diagonal variant of AdaGrad for parameter updates
  • update weights after minibatches of 20 sentences
  • 200 epoches for training
  • keep the overall number of parameters constant: deeper → narrower
  • do not employ a pre-training step
slide-32
SLIDE 32

Outline Hu, NIPS’14 Irsoy, NIPS’14

Experiment Results

Quantitative Evaluation

  • Results on RNNs of various depths and sizes show that deep RNNs outperform

single layer RNNs with approximately the same number of parameters

  • The 2-layer RNN for the smaller networks and 4- layer RNN for the larger networks

give the best performance with respect to the fine-grained score.

  • The best deep RNN outperforms previous work on both the fine-grained and binary

prediction tasks, and outperforms Paragraph Vectors on the fine-grained score.

slide-33
SLIDE 33

Outline Hu, NIPS’14 Irsoy, NIPS’14

Experiment Results

Input Perturbation

Investigate the response of all layers to a perturbation in the input:

  • pick a word from the sentence that carries positive sentiment
  • alter it to a set of words that have sentiment values shifting towards the negative

direction

  • all other nodes have the same representations: a node is completely determined

by its subtree

  • for each node, the response is measured as the change of its hidden representation

in one-norm; for each of the three layers in the network, with respect to the hidden representations using the original word

slide-34
SLIDE 34

Outline Hu, NIPS’14 Irsoy, NIPS’14

Experiment Results

Nearest Neighbor Phrases

  • for a three layer deep recursive neural network: compute hidden representations

for all phrases in the data

  • for a given phrase: find its nearest neighbor phrases across each layer, with the
  • ne-norm distance measure
  • for the 1st layer:similarity is dominated by one of the words that is composed
  • This effect is so strong that it even discards the negation for the second case.
  • for the 2nd layer:a more diverse set of phrases semantically
  • this layer seems to be taking syntactic similarity more into account
  • for the 3rd and final layer:
  • a higher level of semantic similarity: phrases are mostly related to one another

in terms of sentiment

slide-35
SLIDE 35

Outline Hu, NIPS’14 Irsoy, NIPS’14

Summary

Deep recursive neural network:

  • stacking multiple recursive layers on top of each other
  • using binary parse trees as the structure
  • task: fine-grained sentiment classification
  • competitor methods:

shallow RNN, MV-RNN, Tensor RNN, DCNN, Paragraph Vector

  • investigating the models qualitatively by performing input perturbation
  • examining nearest neighboring phrases of given examples