A Convolutional Neural Network for Modelling Sentences Nal - - PowerPoint PPT Presentation

a convolutional neural network for modelling sentences
SMART_READER_LITE
LIVE PREVIEW

A Convolutional Neural Network for Modelling Sentences Nal - - PowerPoint PPT Presentation

A Convolutional Neural Network for Modelling Sentences Nal Kalchbrenner Edward Grefenstette Phil Blunsom Department of Computer Science, Oxford University Overview of Model Represent sentences by extracting more abstract features Input:


slide-1
SLIDE 1

A Convolutional Neural Network for Modelling Sentences

Nal Kalchbrenner Edward Grefenstette Phil Blunsom

Department of Computer Science, Oxford University

slide-2
SLIDE 2

Overview of Model

Represent sentences by extracting more abstract features Input: sequence of word embeddings Output: classification probabilities Each layer involves

  • 1. Convolution
  • 2. Dynamic k-Max Pooling
  • 3. Apply a non-linearity (tanh)
slide-3
SLIDE 3

One-Dimensional Convolution

  • 1. The filter m ∈ Rm
  • 2. The sequence s ∈ Rs

Returns sequence c ∈ Rs−m+1 cj = mTsj−m+1:j, j = 1, ..., s − m + 1 Takes a dot product between length m subsequences of s and the filter m Wide convolution pads s with m − 1 zeros on the left.

slide-4
SLIDE 4

Convolution with Word Embeddings

Assume word embeddings of dimension d Filter m will be in Rd×m Sequence s will be in Rd×s Each row of m will be convolved with the corresponding row of s

slide-5
SLIDE 5

k-Max Pooling (LeCun et al.)

Given k and sequence p ∈ Rp, p ≥ k

  • 1. Return k largest elements of p
  • 2. Keep elements in their original order

Denoted pk

max ∈ Rk

slide-6
SLIDE 6

Dynamic k-Max Pooling

“Smooth extraction of higher-order features” kL = max ✓ ktop, ⇠L − l L s ⇡◆

I ktop is fixed parameter I l is current layer I L is total number of layers I s is sentence length

slide-7
SLIDE 7

Folding

Elementwise sum of pairs rows of a matrix f : Rd×n → Rd/2×n f (M) = N where N[i, j] = M[2i, j] + M[2i + 1, j], i = 0, ..., d/2 − 1, j = 0, ...n − 1

I Introduces dependencies between different

feature rows

I No added parameters

slide-8
SLIDE 8
slide-9
SLIDE 9

Size of Network

Model First Layer Second Layer * Width Filters Width Filters k-top Binary 7 6 5 14 4 Multi-class 10 6 7 12 5

slide-10
SLIDE 10

Training

Top layer is soft-max nonlinearity to predict probability distribution L2 regularization of parameters in objective function Parameters are word embeddings, filter weights, & fully connected layers Trained using Adagrad with mini-batches “Processes multiple millions of sentences per hour

  • n one GPU”
slide-11
SLIDE 11

Experiments

  • 1. Predicting sentiment of movie reviews - binary

(Socher et al. 2013)

  • 2. Predicting sentiment of movie reviews -

multi-class (Socher et al. 2013)

  • 3. Categorization of questions (Li and Roth 2002)
  • 4. Sentiment of Tweets, labels based on

emoticons(Go et al. 2009) Feature embedding dimensionality chosen based on size of dataset

slide-12
SLIDE 12

Movies accuracy

slide-13
SLIDE 13

First layer feature-detectors

slide-14
SLIDE 14

TREC 6-way classification accuracy

slide-15
SLIDE 15

Twitter sentiment

slide-16
SLIDE 16

Conclusion

Dynamic Convolutional Neural Networks

I Convolutions apply function to n-grams I Dynamic k-max pooling extracts most active

feature, and chooses k based on layer and sentence length

I Composing these two operations can be seen

as feature detection

I Outperformed/stayed competitive with other

neural approaches, baseline models, and state-of-the-art approaches without needing handcrafted features