Word Embeddings CS 6956: Deep Learning for NLP Overview - PowerPoint PPT Presentation

Word Embeddings CS 6956: Deep Learning for NLP

Overview • Representing meaning • Word embeddings: Early work • Word embeddings via language models • Word2vec and Glove • Evaluating embeddings • Design choices and open questions 1

Overview • Representing meaning • Word embeddings: Early work • Word embeddings via language models • Word2vec and Glove • Evaluating embeddings • Design choices and open questions 2

Word embeddings via language models The goal: To find vector embeddings of words High level approach: 1. Train a model for a surrogate task (in this case language modeling) 2. Word embeddings are a byproduct of this process 3

Neural network language models • A multi-layer neural network [Bengio et al 2003] Context = previous words in sentence – Words → embedding layer → hidden layers → softmax – Cross-entropy loss • Instead of producing probability, just produce a score for the next word (no softmax) [Collobert and Weston, 2008] – Ranking loss – Intuition: Valid word sequences should get a higher score than invalid ones • No need for a multi-layer network, a shallow network is good enough [Mikolov, 2013, word2vec] Context = previous – Simpler model, fewer parameters and next words in – Faster to train sentence 4

This lecture • The word2vec models: CBOW and Skipgram • Connection between word2vec and matrix factorization • GloVe 5

Word2Vec [Mikolov et al ICLR 2013, Mikolov et al NIPS 2013] • Two architectures for learning word embeddings – Skipgram and CBOW • Both have two key differences from the older Bengio/C&W approaches 1. No hidden layers 2. Extra context (both left and right context) Several computational tricks to make things faster • 6

This lecture • The word2vec models: CBOW and Skipgram • Connection between word2vec and matrix factorization • GloVe 7

� � Continuous Bag of Words (CBOW) Given a window of words of a length 2m + 1 Call them: 𝑦 #$ , ⋯ , 𝑦 #' 𝑦 ( 𝑦 ' , ⋯ , 𝑦 $ Define a probabilistic model for predicting the middle word 𝑄(𝑦 ( ∣ 𝑦 #, , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 , ) Train the model by minimizing loss over the dataset 𝑀 = 1 log 𝑄(𝑦 ( ∣ 𝑦 #, , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 , ) 8

� � Continuous Bag of Words (CBOW) Given a window of words of a length 2m + 1 Call them: 𝑦 #$ , ⋯ , 𝑦 #' 𝑦 ( 𝑦 ' , ⋯ , 𝑦 $ Define a probabilistic model for predicting the middle word 𝑄(𝑦 ( ∣ 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ ) Train the model by minimizing loss over the dataset 𝑀 = 1 log 𝑄(𝑦 ( ∣ 𝑦 #, , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 , ) 9

� � Continuous Bag of Words (CBOW) Given a window of words of a length 2m + 1 Call them: 𝑦 #$ , ⋯ , 𝑦 #' 𝑦 ( 𝑦 ' , ⋯ , 𝑦 $ Define a probabilistic model for predicting the middle word 𝑄(𝑦 ( ∣ 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ ) Train the model by minimizing loss over the dataset 𝑀 = − 1 log 𝑄(𝑦 ( ∣ 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ ) 10

� � Continuous Bag of Words (CBOW) Given a window of words of a length 2m + 1 Call them: 𝑦 #$ , ⋯ , 𝑦 #' 𝑦 ( 𝑦 ' , ⋯ , 𝑦 $ Define a probabilistic model for predicting the middle word Need to define 𝑄(𝑦 ( ∣ 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ ) this to complete the model Train the model by minimizing loss over the dataset 𝑀 = − 1 log 𝑄(𝑦 ( ∣ 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ ) 11

The CBOW model • The classification task – Input: context words 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ – Output: the center word 𝑦 ( – These words correspond to one-hot vectors • Eg: cat would be associated with a dimension, its one-hot vector has 1 in that dimension and zero everywhere else • Notation: – n: the embedding dimension (eg 300) – V: The vocabulary of words we want to embed • Define two matrices: 1. 𝒲 : a matrix of size 𝑜×|𝑊| 2. 𝒳 : a matrix of size 𝑊 ×𝑜 12

Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) 13

Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) Exercise : Write this as a computation graph 18

Input: context 𝑦 #$ , ⋯ , 𝑦 #' , 𝑦 ' , ⋯ , 𝑦 $ Output: the center word 𝑦 ( The CBOW model n: the embedding dimension (eg 300) V: The vocabilary 𝒲 : a matrix of size 𝑜×|𝑊| 𝒳 : a matrix of size 𝑊 ×𝑜 1. Map all the context words into the n dimensional space using 𝒲 We get 2m vectors 𝒲𝑦 #$ , ⋯ , 𝒲𝑦 #' , 𝒲𝑦 ' , ⋯ , 𝒲𝑦 $ – 2. Average these vectors to get a context vector $ A = 1 𝑤 1 𝒲𝑦 C 2𝑛 CD#$,CE( 3. Use this to compute a score vector for the output 𝑡𝑑𝑝𝑠𝑓 = 𝒳𝑤 A 4. Use the score to compute probability via softmax 𝑄 𝑦 ( =⋅ 𝑑𝑝𝑜𝑢𝑓𝑦𝑢 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝒳𝑤 A) Word embeddings: Rows of the matrix corresponding to the output. That is rows of 𝒳 19

Word Embeddings CS 6956: Deep Learning for NLP Overview - PowerPoint PPT Presentation

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word embeddings: Early work Word embeddings via language models Word2vec and Glove Evaluating embeddings Design choices and open questions 1

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Lecture 8: NLP and Word Embeddings Alireza Akhavan Pour CLASS.VISION

Word Embeddings through Hellinger PCA Rmi Lebret and Ronan Collobert Idiap Research Institute /

Searching for the X-Factor: Exploring Corpus Subjectivity for Word Embeddings Maksim Tkachenko

iSCSI PDU Header Notes Octet 0 Octet 1 Octet 2 Octet 3 AHS Data Length Length Always 48

Building Java Programs Chapter 5 Lecture 5-1: while Loops, Fencepost Loops, and Sentinel Loops

Ismaeel Al Ridhawi, Nancy Samaan, Ahmed Karmouch School of

Voice over Wireless LAN Outline Introduction to VoWLAN Wireless LAN Technology Why

Schemes for Pattern-Avoiding Words Lara Pudwell Rutgers University Permutation Patterns 2007

The Lurch Project A word processor that checks your math Past, Present, and Future Nathan Carter

Language Information Retrieval applications: integration of ontology-based methods and Linguistic

Last Class COGS 105 Research Methods for Cognitive Scientists In any behavioral research we

Word Embeddings CS 6956: Deep Learning for NLP Overview - PowerPoint PPT Presentation

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word embeddings: Early work Word embeddings via language models Word2vec and Glove Evaluating embeddings Design choices and open questions 1

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin How to

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Lecture 8: NLP and Word Embeddings Alireza Akhavan Pour CLASS.VISION

Word Embeddings through Hellinger PCA Rmi Lebret and Ronan Collobert Idiap Research Institute /

Searching for the X-Factor: Exploring Corpus Subjectivity for Word Embeddings Maksim Tkachenko

iSCSI PDU Header Notes Octet 0 Octet 1 Octet 2 Octet 3 AHS Data Length Length Always 48

Building Java Programs Chapter 5 Lecture 5-1: while Loops, Fencepost Loops, and Sentinel Loops

Ismaeel Al Ridhawi, Nancy Samaan, Ahmed Karmouch School of

Voice over Wireless LAN Outline Introduction to VoWLAN Wireless LAN Technology Why

Schemes for Pattern-Avoiding Words Lara Pudwell Rutgers University Permutation Patterns 2007

The Lurch Project A word processor that checks your math Past, Present, and Future Nathan Carter

Language Information Retrieval applications: integration of ontology-based methods and Linguistic

Last Class COGS 105 Research Methods for Cognitive Scientists In any behavioral research we

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to