Distributed Representation of Sentences LU Yangyang - - PowerPoint PPT Presentation

distributed representation of sentences
SMART_READER_LITE
LIVE PREVIEW

Distributed Representation of Sentences LU Yangyang - - PowerPoint PPT Presentation

Outline Mikolov,ICML14 Kalchbrenner,ACL14 Hermann,ACL14 Summary Appendix Distributed Representation of Sentences LU Yangyang luyy11@sei.pku.edu.cn July 16,2014 @ KERE Seminar Outline Mikolov,ICML14 Kalchbrenner,ACL14


slide-1
SLIDE 1

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Distributed Representation of Sentences

LU Yangyang

luyy11@sei.pku.edu.cn

July 16,2014 @ KERE Seminar

slide-2
SLIDE 2

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Outline

Distributed Representation of Sentences and Documents. ICML’14 Word Vector Paragraph Vector Experiments of NLP Tasks A Convolutional Neural Network for Modelling Sentences. ACL’14 DCNN: Convolutional Neural Networks Experiments of NLP Tasks Multilingual Models for Compositional Distributed Semantics. ACL’14 Composition Models Experiments Summary

slide-3
SLIDE 3

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Authors

  • Distributed Representation of Sentences and Documents.
  • ICML’141
  • Quoc Le, Tomas Mikolov
  • Google Inc, Mountain View
  • A Convolutional Neural Network for Modelling Sentences.
  • ACL’142
  • Nal Kalchbrenner, Edward Grefenstette, Phil Blunsom
  • University of Oxford
  • Multilingual Models for Compositional Distributed Semantics.
  • ACL’14
  • Karl Moritz Hermann, Phil Blunsom
  • University of Oxford

1http://icml.cc/2014/index/article/15.htm 2http://acl2014.org/acl2014/index.html

slide-4
SLIDE 4

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Outline

Distributed Representation of Sentences and Documents. ICML’14 Word Vector Paragraph Vector Experiments of NLP Tasks A Convolutional Neural Network for Modelling Sentences. ACL’14 Multilingual Models for Compositional Distributed Semantics. ACL’14 Summary

slide-5
SLIDE 5

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Outline

Distributed Representation of Sentences and Documents. ICML’14 Word Vector Paragraph Vector Experiments of NLP Tasks A Convolutional Neural Network for Modelling Sentences. ACL’14 DCNN: Convolutional Neural Networks Experiments of NLP Tasks Multilingual Models for Compositional Distributed Semantics. ACL’14 Composition Models Experiments Summary

slide-6
SLIDE 6

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Recall: Word Vector 3

Every word:

  • A unique vector, represented by a column in a matrix W

Given a sequence of training words w1, w2, w3, ..., wT :

3Mikolov T, et al. Efficient estimation of word representations in vector space[C]. ICLR workshop, 2013

slide-7
SLIDE 7

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Recall: Word Vector 3

Every word:

  • A unique vector, represented by a column in a matrix W

Given a sequence of training words w1, w2, w3, ..., wT :

  • Predicting a word given the other words in a context (CBOW)
  • Predicting the surrounding words given a word (Skip-gram)

3Mikolov T, et al. Efficient estimation of word representations in vector space[C]. ICLR workshop, 2013

slide-8
SLIDE 8

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Recall: Word Vector

The Skip-gram Model 4

  • Predicting the surrounding words given

a word in sentence

  • The objective:

maximize 1 T

T

∑︂

t=1

∑︂

−c≤j≤c,j̸=0

log p(wt+j|wt) where c : the size of the training context

4Mikolov T, et al. Distributed representations of words and phrases and their compositionality[J]. Advances in Neural Information Processing Systems, 2013

slide-9
SLIDE 9

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Recall: Word Vector

Continuous Bag-of-Words Model(CBOW) 5

  • Predicting a word given the other words in

a context

  • The projection layer: shared for all words

(not just the projection matrix)

  • The objective:

maximize 1 T

T −k

∑︂

t=k

log p(wt|wt−k, ..., wt+k)

5Mikolov T, et al. Efficient estimation of word representations in vector space[C]. ICLR workshop, 2013

slide-10
SLIDE 10

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Word Vector

  • The objective:

maximize 1 T

T −k

∑︂

t=k

log p(wt|wt−k, ..., wt+k)

  • The prediction task: via a multiple classifier (e.g. softmax6)

p(wt|wt−k, ..., wt+k) = eyw ∑︁

i eyi

y = b + Uh(wt−k, ..., wt+k; W) where U, b : the softmax parameters h : a concatenation or average of word vectors extracted from W

6GOTO 53

slide-11
SLIDE 11

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Outline

Distributed Representation of Sentences and Documents. ICML’14 Word Vector Paragraph Vector Experiments of NLP Tasks A Convolutional Neural Network for Modelling Sentences. ACL’14 DCNN: Convolutional Neural Networks Experiments of NLP Tasks Multilingual Models for Compositional Distributed Semantics. ACL’14 Composition Models Experiments Summary

slide-12
SLIDE 12

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Paragraph Vector

PV-DM: A Distributed Memory Model

  • The paragraph vectors are asked to contribute to the prediction task
  • f the next word given many contexts sampled from the paragraph.
  • The paragraph acts as a memory that remembers what is missing

from the current context –or the topic of the paragraph.

slide-13
SLIDE 13

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

PV-DM

  • Every paragraph: a column in matrix D
  • Shared across all contexts generated

from the same paragraph but not across paragraphs

  • Every word: a column in matrix W
  • Shared across paragraphs
  • Sampled from a fixed-length context over

the paragraph

  • Concatenate paragraph and word vectors
slide-14
SLIDE 14

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

PV-DM

  • Every paragraph: a column in matrix D
  • Shared across all contexts generated

from the same paragraph but not across paragraphs

  • Every word: a column in matrix W
  • Shared across paragraphs
  • Sampled from a fixed-length context over

the paragraph

  • Concatenate paragraph and word vectors

The only change compared to the word vector model: y = b + Uh(wt−k, ..., wt+k, d; W, D) where h : constructed from W and D d : the vector of the paragraph from which the context is sampled

slide-15
SLIDE 15

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Paragraph Vector without word ordering

PV-DBOW: Distributed Bag-Of-Words7

  • Ignore the context words in the input
  • Force the model to predict words randomly sampled from the

paragraph in the output

  • Sample a text window
  • Sample a random word from the text window
  • Form a classification task given the Paragraph Vector

7Skip-gram Model: GOTO 7

slide-16
SLIDE 16

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Outline

Distributed Representation of Sentences and Documents. ICML’14 Word Vector Paragraph Vector Experiments of NLP Tasks A Convolutional Neural Network for Modelling Sentences. ACL’14 DCNN: Convolutional Neural Networks Experiments of NLP Tasks Multilingual Models for Compositional Distributed Semantics. ACL’14 Composition Models Experiments Summary

slide-17
SLIDE 17

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Sentiment Analysis

Stanford Sentiment Treebank Dataset 8

Dataset:

  • 11855 sentences taken from the movie review site Rotten Tomatoes
  • train/test/development: 8544/2210/1101 sentences
  • sentence/subphrase labels: 5-way

fine-grained(+ + / + /0/ − / − −), binary coarse-grained(pos/neg)

  • here only consider labeling the full sentences
  • treat a sentence as a paragraph

8Socher, R. et al. Recursive deep models for semantic compositionality over a sentiment treebank. EMNLP, 2013

slide-18
SLIDE 18

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Sentiment Analysis

Stanford Sentiment Treebank Dataset 8

Dataset:

  • 11855 sentences taken from the movie review site Rotten Tomatoes
  • train/test/development: 8544/2210/1101 sentences
  • sentence/subphrase labels: 5-way

fine-grained(+ + / + /0/ − / − −), binary coarse-grained(pos/neg)

  • here only consider labeling the full sentences
  • treat a sentence as a paragraph

Experiment protocols:

  • Paraphrase Vector: a concatenation of PV-DM and PV-DBOW
  • PV-DM: 400 dimensions, PV-DBOW: 400 dimensions
  • The optimal window size: 8
  • Predictor of the movie rating: a logistic regression

8Socher, R. et al. Recursive deep models for semantic compositionality over a sentiment treebank. EMNLP, 2013

slide-19
SLIDE 19

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Sentiment Analysis

IMDB Dataset 9

Dataset:

  • 100, 000 movie reviews taken from IMDB
  • each movie review: several sentences
  • labeled train/unlabeled train/labeled test: 25, 000/50, 000/25, 000
  • labels: binary (pos/neg)

9Maas, et al. Learning word vectors for sentiment analysis. ACL, 2011

slide-20
SLIDE 20

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Sentiment Analysis

IMDB Dataset 9

Dataset:

  • 100, 000 movie reviews taken from IMDB
  • each movie review: several sentences
  • labeled train/unlabeled train/labeled test: 25, 000/50, 000/25, 000
  • labels: binary (pos/neg)

Experimental protocols:

  • PV-DM: 400 dimensions, PV-DBOW: 400 dimensions
  • Learning word vectors and paragraph vectors:

25, 000 labeled + 50, 000 unlabeled

  • The predictor: a neural network with one hidden layer with 50 units

and a logistic classifier

  • The optimal window size: 10

9Maas, et al. Learning word vectors for sentiment analysis. ACL, 2011

slide-21
SLIDE 21

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Sentiment Analysis (cont.)

slide-22
SLIDE 22

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Information Retrieval with Paragraph Vector

Dataset:

  • 1, 000, 000 most popular queries × top 10 results, by a search engine
  • Constructing a triplet of paragraphs:
  • 1st, 2nd: results of the same query
  • 3rd: randomly sampled from the rest collection (different query)
  • Task: identify which of the triplet are the results of the same query
slide-23
SLIDE 23

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Outline

Distributed Representation of Sentences and Documents. ICML’14 A Convolutional Neural Network for Modelling Sentences. ACL’14 DCNN: Convolutional Neural Networks Experiments of NLP Tasks Multilingual Models for Compositional Distributed Semantics. ACL’14 Summary

slide-24
SLIDE 24

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Outline

Distributed Representation of Sentences and Documents. ICML’14 Word Vector Paragraph Vector Experiments of NLP Tasks A Convolutional Neural Network for Modelling Sentences. ACL’14 DCNN: Convolutional Neural Networks Experiments of NLP Tasks Multilingual Models for Compositional Distributed Semantics. ACL’14 Composition Models Experiments Summary

slide-25
SLIDE 25

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Recall: Max-TDNN Sentence Model 10

  • TDNNs: Time-Delay Neural Networks
  • Modeling long-distance dependencies
  • time refers to the idea that a sequence has a

notion of order.

  • A TDNN “reads” the sequence in an online

fashion: at time t ≥ 1, one sees xt, the t-th word in the sentence.

  • A classical TDNN layer:
  • A convolution on a given sequence x(·)
  • Outputting another sequence o(·)

10Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask

  • learning. ICML, 2008
slide-26
SLIDE 26

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

DCNN: Overview

Convolutional Neural Networks with Dynamic k-Max Pooling

slide-27
SLIDE 27

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Wide Convolution

  • Each word wi ∈ Rd
  • Sentence matrix s ∈ Rd×s
  • Weight matrix for convolving m ∈ Rd×m
  • Matrix after convolution c ∈ Rd×(s+m−1)
slide-28
SLIDE 28

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

(Dynamic) k-Max Pooling

k-Max Pooling:

  • A generalisation of the max pooling over the time dimension 11
  • Different from the local max pooling operations 12

11Max-TDNN: Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. ICML, 2008 12A convolution network for object recognition: Yann LeCun, et al. Gradient-based learning applied to document recognition.

slide-29
SLIDE 29

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

(Dynamic) k-Max Pooling

k-Max Pooling:

  • A generalisation of the max pooling over the time dimension 11
  • Different from the local max pooling operations 12
  • Given a value k and a sequence p ∈ Rp (p ≥ k), k-Max pooling

selects the subsequence pk

max of the k highest values of p.

  • The order of the values in pk

max max corresponds to their original

  • rder in p.

11Max-TDNN: Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. ICML, 2008 12A convolution network for object recognition: Yann LeCun, et al. Gradient-based learning applied to document recognition.

slide-30
SLIDE 30

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

(Dynamic) k-Max Pooling

k-Max Pooling:

  • A generalisation of the max pooling over the time dimension 11
  • Different from the local max pooling operations 12
  • Given a value k and a sequence p ∈ Rp (p ≥ k), k-Max pooling

selects the subsequence pk

max of the k highest values of p.

  • The order of the values in pk

max max corresponds to their original

  • rder in p.

Dynamic k-Max Pooling: kl = max(ktop, ⌈L − l L ⌉s) where l : the number of the current convolutional layer to which the pooling is applied L : the total number of convolutional layers in the network ktop : the fixed pooling parameter for the topmost convolutional layer

11Max-TDNN: Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. ICML, 2008 12A convolution network for object recognition: Yann LeCun, et al. Gradient-based learning applied to document recognition.

slide-31
SLIDE 31

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Non-linear Feature Function

  • Apply the convolution + non-linear layers 13

→ each d-dimension column a in the matrix a:

M = [diag(m:,1), ..., diag(m:,m)] where m : the weights of the d filters of the wide convolution

  • A wide convolution + a (dynamic) k-max pooling layer + a non-linear function

+ the input sentence matrix → a first order feature map

13Temporarily ignore the pooling layer

slide-32
SLIDE 32

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Multiple Feature Maps

  • Repeating: wide convolution + (dynamic) k-max pooling +

non-linear function → feature maps of increasing order 14 Fi

j = n

∑︂

k=1

mi

j,k * Fi−1 k

where Fi

j : the j-th feature map of the i-th order

* : wide convolution mi

j,k : convolving matrix( all the mi j,k form an order-4 tensor)

14LeCun Y, Bengio Y. Convolutional networks for images, speech, and time series[J]. The handbook of brain theory and neural networks, 1995

slide-33
SLIDE 33

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Folding

In the formulation of the network so far:

  • Applying feature detectors to an individual row
  • Creating complex dependencies across the same rows in multiple

feature maps

  • Feature detectors in different rows, however, are independent of

each other until the top fully connected layer. Folding:

  • For a map of d rows, folding returns a map of d/2 rows
  • Halving the size of the representation
  • With a folding layer, a feature detector of the i-th order depends

now on two rows of feature values in the lower maps of order i?1

slide-34
SLIDE 34

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Outline

Distributed Representation of Sentences and Documents. ICML’14 Word Vector Paragraph Vector Experiments of NLP Tasks A Convolutional Neural Network for Modelling Sentences. ACL’14 DCNN: Convolutional Neural Networks Experiments of NLP Tasks Multilingual Models for Compositional Distributed Semantics. ACL’14 Composition Models Experiments Summary

slide-35
SLIDE 35

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Training

  • The top layer of the network has a fully connected layer followed by

a softmax non-linearity

  • The softmax layer: predicts the probability distribution over classes

given the input sentence

  • The objective:
  • To minimise the cross-entropy of the predicted and true distributions
  • Including an L2 regularization term
slide-36
SLIDE 36

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Sentiment Prediction in Movie Reviews

Stanford Sentiment Treebank Dataset

slide-37
SLIDE 37

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Question Type Classification

TREC Dataset

  • Six different question types
  • Train/test: 5452/500
  • DCNN: word dimension d = 32, a single convolution layer with

filters of size 8 and 5 feature maps

slide-38
SLIDE 38

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Twitter Sentiment Prediction

with Distant Supervision

  • A tweet is automatically labelled as positive or negative depending
  • n the emotion that occurs in it
  • Train/test:

1.6 million(emotion-based labels)/400 (hand-annotated labels)

  • Preprocessing: a vocabulary of 76643 word types
  • DCNN: word dimension d = 60, other parameters same as the binary

sentiment prediction task of Stanford Sentiment Treebank

slide-39
SLIDE 39

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Visualising Feature Detectors

  • A filter in the DCNN:

associated with a feature detector or neuron that learns during training to be particularly active when presented with a specific sequence of input words

  • The first layer: continuous n-grams
  • Higher layers: multiple separate n-grams
slide-40
SLIDE 40

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Outline

Distributed Representation of Sentences and Documents. ICML’14 A Convolutional Neural Network for Modelling Sentences. ACL’14 Multilingual Models for Compositional Distributed Semantics. ACL’14 Composition Models Experiments Summary

slide-41
SLIDE 41

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Multilingual Models for Compositional Distributed Semantics

  • Representing meaning across languages in a shared multilingual

semantic space

  • Proposing a novel unsupervised technique
  • leverages parallel corpora
  • employs semantic transfer through compositional representations
  • Experiments on two corpora:
  • cross-lingual document classification on the Reuters RCV1/RCV2

corpora

  • classification on a massively multilingual corpus which we derive

from the TED corpus

slide-42
SLIDE 42

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Outline

Distributed Representation of Sentences and Documents. ICML’14 Word Vector Paragraph Vector Experiments of NLP Tasks A Convolutional Neural Network for Modelling Sentences. ACL’14 DCNN: Convolutional Neural Networks Experiments of NLP Tasks Multilingual Models for Compositional Distributed Semantics. ACL’14 Composition Models Experiments Summary

slide-43
SLIDE 43

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Overview

  • Word representation: a continuous vector in Rd
  • Semantic representations of sentence and document:
  • Computed by a compositional vector model(CVM)
  • A multilingual objective function:
  • Using a noise-contrastive update between semantic representations
  • f different languages to learn these word embeddings

(a) The cat sat on the red mat. (b) 猫坐在红色的垫子上。 (a) The cat sat on the red mat. (b) Die Katze saß auf der roten Matte.

slide-44
SLIDE 44

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Approach

  • Given enough parallel data, a shared representation of two parallel

sentences would be forced to capture the common elements between these two sentences.

  • What parallel sentences share, of course, are their semantics.
slide-45
SLIDE 45

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Approach

  • Given enough parallel data, a shared representation of two parallel

sentences would be forced to capture the common elements between these two sentences.

  • What parallel sentences share, of course, are their semantics.

Define a bilingual energy: Ebi(a, b) = ‖f(a) − g(b)‖2 where C : a parallel corpus x, y : two different languages (a, b) ∈ C : two sentences of languages x, y f : X → Rd g : Y → Rd

slide-46
SLIDE 46

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Approach (cont.)

  • The objective:

minimize Ebi for all semantically equivalent sentences in the corpus Ehl(a, b, n) = [m + Ebi(a, b) − Ebi(a, n)]+ where [x]+ = max(x, 0) (a, b) ∈ C : positive sample (a, n) ∈ C : negative(or noise) sample

slide-47
SLIDE 47

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Approach (cont.)

  • The objective:

minimize Ebi for all semantically equivalent sentences in the corpus Ehl(a, b, n) = [m + Ebi(a, b) − Ebi(a, n)]+ where [x]+ = max(x, 0) (a, b) ∈ C : positive sample (a, n) ∈ C : negative(or noise) sample

  • The final objective function:

minimize J(θ) = ∑︂

(a,b)∈C k

∑︂

i=1

Ehl(a, b, ni) + λ 2 ‖θ‖2 where θ : all the parameters in the model

slide-48
SLIDE 48

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Composition Models: CVM

Focus on composition functions that do not require any syntactic information

  • ADD model:

f(x) = ∑︁n

i xi

  • A sentence is represented by the sum of its word vectors
  • A distributed bag-of-words approach: ignore the sentence order
slide-49
SLIDE 49

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Composition Models: CVM

Focus on composition functions that do not require any syntactic information

  • ADD model:

f(x) = ∑︁n

i xi

  • A sentence is represented by the sum of its word vectors
  • A distributed bag-of-words approach: ignore the sentence order
  • BI model:

f(x) = ∑︁n

i tanh(xi−1 + xi)

  • Capture bi-gram information
  • A non-linear function
slide-50
SLIDE 50

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Document-level Semantics

  • For a number of tasks, such as topic modelling, representations of
  • bjects beyond the sentence level are required.
  • Extend model to document-level learning:

recursively applying the composition and objective function

slide-51
SLIDE 51

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Outline

Distributed Representation of Sentences and Documents. ICML’14 Word Vector Paragraph Vector Experiments of NLP Tasks A Convolutional Neural Network for Modelling Sentences. ACL’14 DCNN: Convolutional Neural Networks Experiments of NLP Tasks Multilingual Models for Compositional Distributed Semantics. ACL’14 Composition Models Experiments Summary

slide-52
SLIDE 52

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Experiment settings

Dataset:

  • The Europarl corpus v7(RCV)15
  • used for the Cross-Lingual Document Classification(CLDC) task
  • considered the English-German and English-French language pairs
  • A massively multilingual corpus
  • based on the TED corpus16 for IWSLT 2013
  • training: 12, 078 parallel documents (12 languages)
  • used for the topic classification task: 15 most frequent keywords as

topics

Experiment protocols:

  • All model weights were randomly initialised using a Gaussian

distribution (µ = 0, σ2 = 0.1).

  • The number of noise samples for each positive samples: {1, 10, 50}
  • The dimension of all embeddings: d = 128
  • Iterations: 100 for RCV, 500 for TED, 5 for joint

15http://www.statmt.org/europarl/ 16https://wit3.fbk.eu/

slide-53
SLIDE 53

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

RCV1/RCV2 Document Classification

  • ADD: training on 500k sentence pairs
  • f the English-German parallel section
  • ADD+: using an additional 500k

parallel sentences from the English-French corpus

  • Training the document classifier: using

varying sizes between 100 and 10, 000 documents

slide-54
SLIDE 54

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

TED Corpus Experiments

  • Using the training data of the corpus to learn distributed

representations across 12 languages

  • In the single mode:vectors are learnt from a single language pair

(en-X)

  • In the joint mode: vector learning is performed on all parallel

sub-corpora simultaneously.

slide-55
SLIDE 55

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Linguistic Analysis

slide-56
SLIDE 56

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Outline

Distributed Representation of Sentences and Documents. ICML’14 A Convolutional Neural Network for Modelling Sentences. ACL’14 Multilingual Models for Compositional Distributed Semantics. ACL’14 Summary

slide-57
SLIDE 57

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Summary

Mikolov, ICML’14

  • An unsupervised learning of paragraph vector
  • PV-DM
  • PV-DBOW
  • Learning to predict the surrounding words in

contexts sampled from the paragraph

  • Lossing the word order information
  • NLP tasks:
  • Sentiment prediction (Stanford, IMDB)
  • Information retrieval (computing similarity

between snippets)

slide-58
SLIDE 58

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Summary(cont.)

Kalchbrenner,ACL’14

  • A dynamic convolutional neural network
  • DCNN
  • Wide convolution + folding + (dynamic)

k-max pooling + non-linearity

  • NLP tasks:
  • Sentiment prediction (Stanford, Twitter)
  • Question type classification
  • Visualizing feature detectors
slide-59
SLIDE 59

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Summary(cont.)

Hermann,ACL’14

  • A novel method for learning multilingual word

embeddings

  • Leveraging parallel data
  • Defining a multilingual objective function
  • Coupled with simple composition functions
  • CVM & DocCVM: ADD, BI
  • NLP tasks:
  • Cross-lingual document classification (Reuter

RCV1/RCV2)

  • Topic classification (TED)

ALL (Mikolov’14, Kalchbrenner’14, Hermann’14): Without requiring external features as provided by parsers or other resources

slide-60
SLIDE 60

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Related Neural Sentence Models

Neural Bag-of-Words(NBoW) models

  • Mikolov T. et al. Distributed Representations of Words and Phrases and their
  • Compositionality. NIPS, 2013
  • Bengio Y. et al. A Neural Probabilistic Language Model. JMLR, 2006

Models that adopts a more general structure

  • Socher R. et al. Recursive deep models for semantic compositionality over a

sentiment treebank. EMNLP, 2013

  • Socher R. et al. Grounded Compositional Semantics for Finding and Describing

Images with Sentences. TACL, 2013

  • Jordan B. Pollack. Recursive distributed representations. Artificial Intelligence,

1990 Models based on convolution and TDNN architeture

  • Kalchbrenner N. and Blunsom P. Recurrent Convolutional Neural Networks for

Discourse Compositionality. ACL, 2013

  • Collobert R. and Weston J. A unified architecture for natural language

processing: Deep neural networks with multitask learning. ICML, 2008

slide-61
SLIDE 61

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Thank You for Listening! Q & A

slide-62
SLIDE 62

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

A Neural Probabilistic Language Model 17 18

y = b + Wx + U tanh(d + Hx) x = (C(wt−1), C(wt−2), ..., C(wt−n+1))

17Bengio Y. et al. A Neural Probabilistic Language Model. JMLR, 2006 18Word Vector: GOTO 9

slide-63
SLIDE 63

Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix

Stanford Sentiment Treebank 19

19Socher R. et al. Recursive deep models for semantic compositionality over a sentiment treebank. EMNLP, 2013