AMMI Introduction to Deep Learning 11.3. Word embeddings and - PowerPoint PPT Presentation

AMMI – Introduction to Deep Learning 11.3. Word embeddings and translation Fran¸ cois Fleuret https://fleuret.org/ammi-2018/ November 2, 2018 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

Word embeddings and CBOW Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.3. Word embeddings and translation 1 / 31

An important application domain for machine intelligence is Natural Language Processing (NLP). • Speech and (hand)writing recognition, • auto-captioning, • part-of-speech tagging, • sentiment prediction, • translation, • question answering. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.3. Word embeddings and translation 2 / 31

An important application domain for machine intelligence is Natural Language Processing (NLP). • Speech and (hand)writing recognition, • auto-captioning, • part-of-speech tagging, • sentiment prediction, • translation, • question answering. While language modeling was historically addressed with formal methods, in particular generative grammars, state-of-the-art and deployed methods are now heavily based on statistical learning and deep learning. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.3. Word embeddings and translation 2 / 31

A core difficulty of Natural Language Processing is to devise a proper density model for sequences of words. However, since a vocabulary is usually of the order of 10 4 − 10 6 words, empirical distributions can not be estimated for more than triplets of words. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.3. Word embeddings and translation 3 / 31

The standard strategy to mitigate this problem is to embed words into a geometrical space to take advantage of data regularities for further [statistical] modeling. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.3. Word embeddings and translation 4 / 31

The standard strategy to mitigate this problem is to embed words into a geometrical space to take advantage of data regularities for further [statistical] modeling. The geometry after embedding should account for synonymy, but also for identical word classes, etc. E.g. we would like such an embedding to make “cat” and “tiger” close, but also “red” and “blue”, or “eat” and “work”, etc. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.3. Word embeddings and translation 4 / 31

The standard strategy to mitigate this problem is to embed words into a geometrical space to take advantage of data regularities for further [statistical] modeling. The geometry after embedding should account for synonymy, but also for identical word classes, etc. E.g. we would like such an embedding to make “cat” and “tiger” close, but also “red” and “blue”, or “eat” and “work”, etc. Even though they are not “deep”, classical word embedding models are key elements of NLP with deep-learning. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.3. Word embeddings and translation 4 / 31

Let k t ∈ { 1 , . . . , W } , t = 1 , . . . , T be a training sequence of T words, encoded as IDs through a vocabulary of W words. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.3. Word embeddings and translation 5 / 31

Let k t ∈ { 1 , . . . , W } , t = 1 , . . . , T be a training sequence of T words, encoded as IDs through a vocabulary of W words. Given an embedding dimension D , the objective is to learn vectors E k ∈ R D , k ∈ { 1 , . . . , W } so that “similar” words are embedded with “similar” vectors. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.3. Word embeddings and translation 5 / 31

A common word embedding is the Continuous Bag of Words (CBOW) version of word2vec (Mikolov et al., 2013a). Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.3. Word embeddings and translation 6 / 31

A common word embedding is the Continuous Bag of Words (CBOW) version of word2vec (Mikolov et al., 2013a). In this model, they embedding vectors are chosen so that a word can be predicted from [a linear function of] the sum of the embeddings of words around it. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.3. Word embeddings and translation 6 / 31

More formally, let C ∈ N ∗ be a “context size”, and 풞 t = ( k t − C , . . . , k t − 1 , k t +1 , . . . , k t + C ) be the “context” around k t , that is the indexes of words around it. C C . . . · · · k t − C · · · k t − 1 k t +1 · · · k t + C k 1 k t k T 풞 t Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.3. Word embeddings and translation 7 / 31

The embeddings vectors E k ∈ R D , k = 1 , . . . , W , are optimized jointly with an array M ∈ R W × D so that the predicted vector of W scores � ψ ( t ) = M E k k ∈ 풞 t is a good predictor of the value of k t . Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.3. Word embeddings and translation 8 / 31

Ideally we would minimize the cross-entropy between the vector of scores ψ ( t ) ∈ R W and the class k t � � exp ψ ( t ) k t � − log . � W k =1 exp ψ ( t ) k t However, given the vocabulary size, doing so is numerically unstable and computationally demanding. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.3. Word embeddings and translation 9 / 31

The “negative sampling” approach uses a loss estimated on the prediction for the correct class k t and only Q ≪ W incorrect classes κ t , 1 , . . . , κ t , Q sampled at random. In our implementation we take the later uniformly in { 1 , . . . , W } and use the same loss as Mikolov et al. (2013b): Q � � � � � 1 + e − ψ ( t ) kt � 1 + e ψ ( t ) κ t , q log + log . t q =1 We want ψ ( t ) k t to be large and all the ψ ( t ) κ t , q to be small. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.3. Word embeddings and translation 10 / 31

Although the operation x �→ E x could be implemented as the product between a one-hot vector and a matrix, it is far more efficient to use an actual lookup table. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.3. Word embeddings and translation 11 / 31

The PyTorch module nn.Embedding does precisely that. It is parametrized with a number N of words to embed, and an embedding dimension D . It gets as input an integer tensor of arbitrary dimension A 1 × · · · × A U , containing values in { 0 , . . . , N − 1 } and it returns a float tensor of dimension A 1 × · · · × A U × D . If w are the embedding vectors, x the input tensor, y the result, we have y [ a 1 , . . . , a U , d ] = w [ x [ a 1 , . . . , a U ]][ d ] . Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.3. Word embeddings and translation 12 / 31

>>> e = nn.Embedding(10, 3) >>> x = torch.tensor([[1, 1, 2, 2], [0, 1, 9, 9]], dtype = torch.int64) >>> e(x) tensor([[[ 0.0386, -0.5513, -0.7518], [ 0.0386, -0.5513, -0.7518], [-0.4033, 0.6810, 0.1060], [-0.4033, 0.6810, 0.1060]], [[-0.5543, -1.6952, 1.2366], [ 0.0386, -0.5513, -0.7518], [ 0.2793, -0.9632, 1.6280], [ 0.2793, -0.9632, 1.6280]]]) Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.3. Word embeddings and translation 13 / 31

Our CBOW model has as parameters two embeddings E ∈ R W × D M ∈ R W × D . and Its forward gets as input a pair of integer tensors corresponding to a batch of size B : • c of size B × 2 C contains the IDs of the words in a context, and • d of size B × R contains the IDs, for each of the B contexts, of the R words for which we want the prediction score (that will be the correct one and Q negative ones). it returns a tensor y of size B × R containing the dot products. �� y [ n , j ] = 1 D M d [ n , j ] · E c [ n , i ] . i Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.3. Word embeddings and translation 14 / 31

class CBOW(nn.Module): def __init__(self, voc_size = 0, embed_dim = 0): super(CBOW, self).__init__() self.embed_dim = embed_dim self.embed_E = nn.Embedding(voc_size, embed_dim) self.embed_M = nn.Embedding(voc_size, embed_dim) def forward(self, c, d): sum_w_E = self.embed_E(c).sum(1).unsqueeze(1).transpose(1, 2) w_M = self.embed_M(d) return w_M.matmul(sum_w_E).squeeze(2) / self.embed_dim Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.3. Word embeddings and translation 15 / 31

Regarding the loss, we can use nn.BCEWithLogitsLoss which implements � y t log(1 + exp( − x t )) + (1 − y t ) log(1 + exp( x t )) . t It takes care in particular of the numerical problem that may arise for large values of x t if implemented “naively”. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 11.3. Word embeddings and translation 16 / 31

AMMI Introduction to Deep Learning 11.3. Word embeddings and - PowerPoint PPT Presentation

AMMI Introduction to Deep Learning 11.3. Word embeddings and translation Fran cois Fleuret https://fleuret.org/ammi-2018/ November 2, 2018 COLE POLYTECHNIQUE FDRALE DE LAUSANNE Word embeddings and CBOW Fran cois Fleuret AMMI

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

AMMI Introduction to Deep Learning 6.5. Residual networks Fran cois Fleuret

AMMI Introduction to Deep Learning 8.4. Optimizing inputs Fran cois Fleuret

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

AMMI Introduction to Deep Learning 6.3. Dropout Fran cois Fleuret

AMMI Introduction to Deep Learning 9.1. Transposed convolutions Fran cois Fleuret

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

AMMI Introduction to Deep Learning 11.2. LSTM and GRU Fran cois Fleuret

AMMI Introduction to Deep Learning 7.2. Networks for image classification Fran cois

AMMI Introduction to Deep Learning 1.3. What is really happening? Fran cois Fleuret

AMMI Introduction to Deep Learning 8.3. Visualizing the processing in the input Fran cois

= RW Dra Sergey Nikolaevich Blazhko (November

= RW Dra Sergey Nikolaevich Blazhko (November

A F R A M E W O R K F O R A U T O M AT I C Q U E S T I O N G E N E R AT I O N F R O M T E X

Capturing Translational Divergences with Zhechev & Andy Way a Statistical Tree-to-Tree

M ULTI UN A M ULTILINGUAL C ORPUS FROM U NITED N ATION D OCUMENTS Andreas Eisele, Yu Chen DFKI

Automatic Alignment and Annotation Projection for Literary Texts Uli Steinbach Ines Rehbein

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Towards European Transla/on Cloud: Development of public MT services in the Bal/c countries

Sambuz

Useful Links

Newsletter

Mail Us

AMMI Introduction to Deep Learning 11.3. Word embeddings and - PowerPoint PPT Presentation

AMMI Introduction to Deep Learning 11.3. Word embeddings and translation Fran cois Fleuret https://fleuret.org/ammi-2018/ November 2, 2018 COLE POLYTECHNIQUE FDRALE DE LAUSANNE Word embeddings and CBOW Fran cois Fleuret AMMI

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

AMMI Introduction to Deep Learning 6.5. Residual networks Fran cois Fleuret

AMMI Introduction to Deep Learning 8.4. Optimizing inputs Fran cois Fleuret

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

AMMI Introduction to Deep Learning 6.3. Dropout Fran cois Fleuret

AMMI Introduction to Deep Learning 9.1. Transposed convolutions Fran cois Fleuret

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

AMMI Introduction to Deep Learning 11.2. LSTM and GRU Fran cois Fleuret

AMMI Introduction to Deep Learning 7.2. Networks for image classification Fran cois

AMMI Introduction to Deep Learning 1.3. What is really happening? Fran cois Fleuret

AMMI Introduction to Deep Learning 8.3. Visualizing the processing in the input Fran cois

= RW Dra Sergey Nikolaevich Blazhko (November

= RW Dra Sergey Nikolaevich Blazhko (November

A F R A M E W O R K F O R A U T O M AT I C Q U E S T I O N G E N E R AT I O N F R O M T E X

Capturing Translational Divergences with Zhechev &amp; Andy Way a Statistical Tree-to-Tree

M ULTI UN A M ULTILINGUAL C ORPUS FROM U NITED N ATION D OCUMENTS Andreas Eisele, Yu Chen DFKI

Automatic Alignment and Annotation Projection for Literary Texts Uli Steinbach Ines Rehbein

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Towards European Transla/on Cloud: Development of public MT services in the Bal/c countries

Sambuz

Useful Links

Newsletter

Mail Us

Capturing Translational Divergences with Zhechev & Andy Way a Statistical Tree-to-Tree