Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Distributed Representation of Sentences
LU Yangyang
luyy11@sei.pku.edu.cn
July 16,2014 @ KERE Seminar
Distributed Representation of Sentences LU Yangyang - - PowerPoint PPT Presentation
Outline Mikolov,ICML14 Kalchbrenner,ACL14 Hermann,ACL14 Summary Appendix Distributed Representation of Sentences LU Yangyang luyy11@sei.pku.edu.cn July 16,2014 @ KERE Seminar Outline Mikolov,ICML14 Kalchbrenner,ACL14
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
LU Yangyang
luyy11@sei.pku.edu.cn
July 16,2014 @ KERE Seminar
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Distributed Representation of Sentences and Documents. ICML’14 Word Vector Paragraph Vector Experiments of NLP Tasks A Convolutional Neural Network for Modelling Sentences. ACL’14 DCNN: Convolutional Neural Networks Experiments of NLP Tasks Multilingual Models for Compositional Distributed Semantics. ACL’14 Composition Models Experiments Summary
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
1http://icml.cc/2014/index/article/15.htm 2http://acl2014.org/acl2014/index.html
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Distributed Representation of Sentences and Documents. ICML’14 Word Vector Paragraph Vector Experiments of NLP Tasks A Convolutional Neural Network for Modelling Sentences. ACL’14 Multilingual Models for Compositional Distributed Semantics. ACL’14 Summary
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Distributed Representation of Sentences and Documents. ICML’14 Word Vector Paragraph Vector Experiments of NLP Tasks A Convolutional Neural Network for Modelling Sentences. ACL’14 DCNN: Convolutional Neural Networks Experiments of NLP Tasks Multilingual Models for Compositional Distributed Semantics. ACL’14 Composition Models Experiments Summary
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Every word:
Given a sequence of training words w1, w2, w3, ..., wT :
3Mikolov T, et al. Efficient estimation of word representations in vector space[C]. ICLR workshop, 2013
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Every word:
Given a sequence of training words w1, w2, w3, ..., wT :
3Mikolov T, et al. Efficient estimation of word representations in vector space[C]. ICLR workshop, 2013
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
The Skip-gram Model 4
a word in sentence
maximize 1 T
T
∑︂
t=1
∑︂
−c≤j≤c,j̸=0
log p(wt+j|wt) where c : the size of the training context
4Mikolov T, et al. Distributed representations of words and phrases and their compositionality[J]. Advances in Neural Information Processing Systems, 2013
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Continuous Bag-of-Words Model(CBOW) 5
a context
(not just the projection matrix)
maximize 1 T
T −k
∑︂
t=k
log p(wt|wt−k, ..., wt+k)
5Mikolov T, et al. Efficient estimation of word representations in vector space[C]. ICLR workshop, 2013
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
maximize 1 T
T −k
∑︂
t=k
log p(wt|wt−k, ..., wt+k)
p(wt|wt−k, ..., wt+k) = eyw ∑︁
i eyi
y = b + Uh(wt−k, ..., wt+k; W) where U, b : the softmax parameters h : a concatenation or average of word vectors extracted from W
6GOTO 53
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Distributed Representation of Sentences and Documents. ICML’14 Word Vector Paragraph Vector Experiments of NLP Tasks A Convolutional Neural Network for Modelling Sentences. ACL’14 DCNN: Convolutional Neural Networks Experiments of NLP Tasks Multilingual Models for Compositional Distributed Semantics. ACL’14 Composition Models Experiments Summary
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
PV-DM: A Distributed Memory Model
from the current context –or the topic of the paragraph.
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
from the same paragraph but not across paragraphs
the paragraph
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
from the same paragraph but not across paragraphs
the paragraph
The only change compared to the word vector model: y = b + Uh(wt−k, ..., wt+k, d; W, D) where h : constructed from W and D d : the vector of the paragraph from which the context is sampled
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
PV-DBOW: Distributed Bag-Of-Words7
paragraph in the output
7Skip-gram Model: GOTO 7
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Distributed Representation of Sentences and Documents. ICML’14 Word Vector Paragraph Vector Experiments of NLP Tasks A Convolutional Neural Network for Modelling Sentences. ACL’14 DCNN: Convolutional Neural Networks Experiments of NLP Tasks Multilingual Models for Compositional Distributed Semantics. ACL’14 Composition Models Experiments Summary
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Stanford Sentiment Treebank Dataset 8
Dataset:
fine-grained(+ + / + /0/ − / − −), binary coarse-grained(pos/neg)
8Socher, R. et al. Recursive deep models for semantic compositionality over a sentiment treebank. EMNLP, 2013
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Stanford Sentiment Treebank Dataset 8
Dataset:
fine-grained(+ + / + /0/ − / − −), binary coarse-grained(pos/neg)
Experiment protocols:
8Socher, R. et al. Recursive deep models for semantic compositionality over a sentiment treebank. EMNLP, 2013
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
IMDB Dataset 9
Dataset:
9Maas, et al. Learning word vectors for sentiment analysis. ACL, 2011
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
IMDB Dataset 9
Dataset:
Experimental protocols:
25, 000 labeled + 50, 000 unlabeled
and a logistic classifier
9Maas, et al. Learning word vectors for sentiment analysis. ACL, 2011
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Dataset:
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Distributed Representation of Sentences and Documents. ICML’14 A Convolutional Neural Network for Modelling Sentences. ACL’14 DCNN: Convolutional Neural Networks Experiments of NLP Tasks Multilingual Models for Compositional Distributed Semantics. ACL’14 Summary
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Distributed Representation of Sentences and Documents. ICML’14 Word Vector Paragraph Vector Experiments of NLP Tasks A Convolutional Neural Network for Modelling Sentences. ACL’14 DCNN: Convolutional Neural Networks Experiments of NLP Tasks Multilingual Models for Compositional Distributed Semantics. ACL’14 Composition Models Experiments Summary
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
notion of order.
fashion: at time t ≥ 1, one sees xt, the t-th word in the sentence.
10Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Convolutional Neural Networks with Dynamic k-Max Pooling
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
k-Max Pooling:
11Max-TDNN: Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. ICML, 2008 12A convolution network for object recognition: Yann LeCun, et al. Gradient-based learning applied to document recognition.
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
k-Max Pooling:
selects the subsequence pk
max of the k highest values of p.
max max corresponds to their original
11Max-TDNN: Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. ICML, 2008 12A convolution network for object recognition: Yann LeCun, et al. Gradient-based learning applied to document recognition.
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
k-Max Pooling:
selects the subsequence pk
max of the k highest values of p.
max max corresponds to their original
Dynamic k-Max Pooling: kl = max(ktop, ⌈L − l L ⌉s) where l : the number of the current convolutional layer to which the pooling is applied L : the total number of convolutional layers in the network ktop : the fixed pooling parameter for the topmost convolutional layer
11Max-TDNN: Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. ICML, 2008 12A convolution network for object recognition: Yann LeCun, et al. Gradient-based learning applied to document recognition.
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
→ each d-dimension column a in the matrix a:
M = [diag(m:,1), ..., diag(m:,m)] where m : the weights of the d filters of the wide convolution
+ the input sentence matrix → a first order feature map
13Temporarily ignore the pooling layer
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
non-linear function → feature maps of increasing order 14 Fi
j = n
∑︂
k=1
mi
j,k * Fi−1 k
where Fi
j : the j-th feature map of the i-th order
* : wide convolution mi
j,k : convolving matrix( all the mi j,k form an order-4 tensor)
14LeCun Y, Bengio Y. Convolutional networks for images, speech, and time series[J]. The handbook of brain theory and neural networks, 1995
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
In the formulation of the network so far:
feature maps
each other until the top fully connected layer. Folding:
now on two rows of feature values in the lower maps of order i?1
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Distributed Representation of Sentences and Documents. ICML’14 Word Vector Paragraph Vector Experiments of NLP Tasks A Convolutional Neural Network for Modelling Sentences. ACL’14 DCNN: Convolutional Neural Networks Experiments of NLP Tasks Multilingual Models for Compositional Distributed Semantics. ACL’14 Composition Models Experiments Summary
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
a softmax non-linearity
given the input sentence
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Stanford Sentiment Treebank Dataset
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
TREC Dataset
filters of size 8 and 5 feature maps
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
with Distant Supervision
1.6 million(emotion-based labels)/400 (hand-annotated labels)
sentiment prediction task of Stanford Sentiment Treebank
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
associated with a feature detector or neuron that learns during training to be particularly active when presented with a specific sequence of input words
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Distributed Representation of Sentences and Documents. ICML’14 A Convolutional Neural Network for Modelling Sentences. ACL’14 Multilingual Models for Compositional Distributed Semantics. ACL’14 Composition Models Experiments Summary
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
semantic space
corpora
from the TED corpus
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Distributed Representation of Sentences and Documents. ICML’14 Word Vector Paragraph Vector Experiments of NLP Tasks A Convolutional Neural Network for Modelling Sentences. ACL’14 DCNN: Convolutional Neural Networks Experiments of NLP Tasks Multilingual Models for Compositional Distributed Semantics. ACL’14 Composition Models Experiments Summary
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
(a) The cat sat on the red mat. (b) 猫坐在红色的垫子上。 (a) The cat sat on the red mat. (b) Die Katze saß auf der roten Matte.
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
sentences would be forced to capture the common elements between these two sentences.
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
sentences would be forced to capture the common elements between these two sentences.
Define a bilingual energy: Ebi(a, b) = ‖f(a) − g(b)‖2 where C : a parallel corpus x, y : two different languages (a, b) ∈ C : two sentences of languages x, y f : X → Rd g : Y → Rd
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
minimize Ebi for all semantically equivalent sentences in the corpus Ehl(a, b, n) = [m + Ebi(a, b) − Ebi(a, n)]+ where [x]+ = max(x, 0) (a, b) ∈ C : positive sample (a, n) ∈ C : negative(or noise) sample
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
minimize Ebi for all semantically equivalent sentences in the corpus Ehl(a, b, n) = [m + Ebi(a, b) − Ebi(a, n)]+ where [x]+ = max(x, 0) (a, b) ∈ C : positive sample (a, n) ∈ C : negative(or noise) sample
minimize J(θ) = ∑︂
(a,b)∈C k
∑︂
i=1
Ehl(a, b, ni) + λ 2 ‖θ‖2 where θ : all the parameters in the model
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Focus on composition functions that do not require any syntactic information
f(x) = ∑︁n
i xi
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Focus on composition functions that do not require any syntactic information
f(x) = ∑︁n
i xi
f(x) = ∑︁n
i tanh(xi−1 + xi)
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
recursively applying the composition and objective function
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Distributed Representation of Sentences and Documents. ICML’14 Word Vector Paragraph Vector Experiments of NLP Tasks A Convolutional Neural Network for Modelling Sentences. ACL’14 DCNN: Convolutional Neural Networks Experiments of NLP Tasks Multilingual Models for Compositional Distributed Semantics. ACL’14 Composition Models Experiments Summary
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Dataset:
topics
Experiment protocols:
distribution (µ = 0, σ2 = 0.1).
15http://www.statmt.org/europarl/ 16https://wit3.fbk.eu/
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
parallel sentences from the English-French corpus
varying sizes between 100 and 10, 000 documents
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
representations across 12 languages
(en-X)
sub-corpora simultaneously.
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Distributed Representation of Sentences and Documents. ICML’14 A Convolutional Neural Network for Modelling Sentences. ACL’14 Multilingual Models for Compositional Distributed Semantics. ACL’14 Summary
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Mikolov, ICML’14
contexts sampled from the paragraph
between snippets)
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Kalchbrenner,ACL’14
k-max pooling + non-linearity
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Hermann,ACL’14
embeddings
RCV1/RCV2)
ALL (Mikolov’14, Kalchbrenner’14, Hermann’14): Without requiring external features as provided by parsers or other resources
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Neural Bag-of-Words(NBoW) models
Models that adopts a more general structure
sentiment treebank. EMNLP, 2013
Images with Sentences. TACL, 2013
1990 Models based on convolution and TDNN architeture
Discourse Compositionality. ACL, 2013
processing: Deep neural networks with multitask learning. ICML, 2008
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
y = b + Wx + U tanh(d + Hx) x = (C(wt−1), C(wt−2), ..., C(wt−n+1))
17Bengio Y. et al. A Neural Probabilistic Language Model. JMLR, 2006 18Word Vector: GOTO 9
Outline Mikolov,ICML’14 Kalchbrenner,ACL’14 Hermann,ACL’14 Summary Appendix
19Socher R. et al. Recursive deep models for semantic compositionality over a sentiment treebank. EMNLP, 2013