word embeddings
play

Word Embeddings CS 6956: Deep Learning for NLP Overview - PowerPoint PPT Presentation

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word embeddings: Early work Word embeddings via language models Word2vec and Glove Evaluating embeddings Design choices and open questions 1


  1. Word Embeddings CS 6956: Deep Learning for NLP

  2. Overview • Representing meaning • Word embeddings: Early work • Word embeddings via language models • Word2vec and Glove • Evaluating embeddings • Design choices and open questions 1

  3. Overview • Representing meaning • Word embeddings: Early work • Word embeddings via language models • Word2vec and Glove • Evaluating embeddings • Design choices and open questions 2

  4. Word embeddings via language models The goal: To find vector embeddings of words High level approach: 1. Train a model for a surrogate task (in this case language modeling) 2. Word embeddings are a byproduct of this process 3

  5. Neural network language models • A multi-layer neural network [Bengio et al 2003] Context = previous words in sentence – Words → embedding layer → hidden layers → softmax – Cross-entropy loss • Instead of producing probability, just produce a score for the next word (no softmax) [Collobert and Weston, 2008] – Ranking loss – Intuition: Valid word sequences should get a higher score than invalid ones • No need for a multi-layer network, a shallow network is good enough [Mikolov, 2013, word2vec] Context = previous – Simpler model, fewer parameters and next words in – Faster to train sentence 4

  6. This lecture • The word2vec models: CBOW and Skipgram • Connection between word2vec and matrix factorization • GloVe 5

  7. Word2Vec [Mikolov et al ICLR 2013, Mikolov et al NIPS 2013] • Two architectures for learning word embeddings – Skipgram and CBOW • Both have two key differences from the older Bengio/C&W approaches 1. No hidden layers 2. Extra context (both left and right context) Several computational tricks to make things faster • 6

  8. This lecture • The word2vec models: CBOW and Skipgram • Connection between word2vec and matrix factorization • GloVe 7

  9. � � Continuous Bag of Words (CBOW) Given a window of words of a length 2m + 1 Call them: 𝑦 #$ , ⋯ , 𝑦 #' 𝑦 ( 𝑦 ' , ⋯ , 𝑦 $ Define a probabilistic model for predicting the middle word 𝑄(𝑦 ( ∣ 𝑦 #, , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 , ) Train the model by minimizing loss over the dataset 𝑀 = 1 log 𝑄(𝑦 ( ∣ 𝑦 #, , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 , ) 8

  10. � � Continuous Bag of Words (CBOW) Given a window of words of a length 2m + 1 Call them: 𝑦 #$ , ⋯ , 𝑦 #' 𝑦 ( 𝑦 ' , ⋯ , 𝑦 $ Define a probabilistic model for predicting the middle word 𝑄(𝑦 ( ∣ 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ ) Train the model by minimizing loss over the dataset 𝑀 = 1 log 𝑄(𝑦 ( ∣ 𝑦 #, , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 , ) 9

  11. � � Continuous Bag of Words (CBOW) Given a window of words of a length 2m + 1 Call them: 𝑦 #$ , ⋯ , 𝑦 #' 𝑦 ( 𝑦 ' , ⋯ , 𝑦 $ Define a probabilistic model for predicting the middle word 𝑄(𝑦 ( ∣ 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ ) Train the model by minimizing loss over the dataset 𝑀 = − 1 log 𝑄(𝑦 ( ∣ 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ ) 10

  12. � � Continuous Bag of Words (CBOW) Given a window of words of a length 2m + 1 Call them: 𝑦 #$ , ⋯ , 𝑦 #' 𝑦 ( 𝑦 ' , ⋯ , 𝑦 $ Define a probabilistic model for predicting the middle word Need to define 𝑄(𝑦 ( ∣ 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ ) this to complete the model Train the model by minimizing loss over the dataset 𝑀 = − 1 log 𝑄(𝑦 ( ∣ 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ ) 11

  13. The CBOW model • The classification task – Input: context words 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ – Output: the center word 𝑦 ( – These words correspond to one-hot vectors • Eg: cat would be associated with a dimension, its one-hot vector has 1 in that dimension and zero everywhere else • Notation: – n: the embedding dimension (eg 300) – V: The vocabulary of words we want to embed • Define two matrices: 1. 𝒲 : a matrix of size 𝑜×|𝑊| 2. 𝒳 : a matrix of size 𝑊 ×𝑜 12

  14. Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) 13

  15. Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) 14

  16. Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) 15

  17. Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) 16

  18. Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) 17

  19. Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) Exercise : Write this as a computation graph 18

  20. Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) Word embeddings: Rows of the matrix corresponding to the output. That is rows of 𝒳 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend