Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of - PowerPoint PPT Presentation

Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1

This lecture v Learning word vectors (Cont.) v Representation learning in NLP 6501 Natural Language Processing 2

Recap: Latent Semantic Analysis v Data representation v Encode single-relational data in a matrix v Co-occurrence (e.g., from a general corpus) v Synonyms (e.g., from a thesaurus) v Factorization v Apply SVD to the matrix to find latent components v Measuring degree of relation v Cosine of latent vectors

Recap: Mapping to Latent Space via SVD 𝚻 𝐖 ' ≈ 𝑫 𝐕 𝑙×𝑙 𝑙×𝑜 𝑒×𝑜 𝑒×𝑙 v SVD generalizes the original data v Uncovers relationships not explicit in the thesaurus v Term vectors projected to 𝑙 -dim latent space v Word similarity: cosine of two column vectors in 𝚻𝐖 $

Low rank approximation v Frobenius norm. C is a 𝑛×𝑜 matrix 9 6 1 1 |𝑑 34 | 5 ||𝐷|| / = 378 478 v Rank of a matrix. v How many vectors in the matrix are independent to each other 6501 Natural Language Processing 5

Low rank approximation v Low rank approximation problem: min = ||𝐷 − 𝑌|| / 𝑡. 𝑢. 𝑠𝑏𝑜𝑙 𝑌 = 𝑙 v If I can only use k independent vectors to describe the points in the space, what are the best choices? Essentially, we minimize the “ reconstruction loss ” under a low rank constraint 6501 Natural Language Processing 6

Low rank approximation v Low rank approximation problem: min = ||𝐷 − 𝑌|| / 𝑡. 𝑢. 𝑠𝑏𝑜𝑙 𝑌 = 𝑙 v If I can only use k independent vectors to describe the points in the space, what are the best choices? Essentially, we minimize the “ reconstruction loss ” under a low rank constraint 6501 Natural Language Processing 7

Low rank approximation v Assume rank of 𝐷 is r v SVD: 𝐷 = 𝑉Σ𝑊 ' , Σ = diag(𝜏 8 ,𝜏 5 … 𝜏 P , 0,0,0, …0) 𝜏 8 0 0 𝑠 non-zeros Σ = 0 ⋱ 0 0 0 0 v Zero-out the r − 𝑙 trailing values Σ′ = diag(𝜏 8 , 𝜏 5 …𝜏 U , 0,0,0,… 0) v 𝐷 V = UΣ V 𝑊 ' is the best k-rank approximation: C V = 𝑏𝑠𝑕 min = ||𝐷 − 𝑌|| / 𝑡.𝑢. 𝑠𝑏𝑜𝑙 𝑌 = 𝑙 6501 Natural Language Processing 8

Word2Vec v LSA: a compact representation of co- occurrence matrix v Word2Vec:Predict surrounding words (skip-gram) v Similar to using co-occurrence counts Levy&Goldberg (2014), Pennington et al. (2014) v Easy to incorporate new words or sentences 6501 Natural Language Processing 9

Word2Vec v Similar to language model, but predicting next word is not the goal. v Idea: words that are semantically similar often occur near each other in text v Embeddings that are good at predicting neighboring words are also good at representing similarity 6501 Natural Language Processing 10

Skip-gram v.s Continuous bag-of-words v What differences? 6501 Natural Language Processing 11

Skip-gram v.s Continuous bag-of-words 6501 Natural Language Processing 12

Objective of Word2Vec (Skip-gram) v Maximize the log likelihood of context word 𝑥 \]9 , 𝑥 \]9^8, …, 𝑥 \]8 , 𝑥 \^8 , 𝑥 \^5 , …, 𝑥 \^9 given word 𝑥 \ v m is usually 5~10 6501 Natural Language Processing 13

Objective of Word2Vec (Skip-gram) v How to model log 𝑄(𝑥 \^4 |𝑥 \ ) ? cde (f ghij ⋅ l gh ) 𝑞 𝑥 \^4 𝑥 \ = ∑ cde (f gn ⋅ l gh ) gn v softmax function Again! v Every word has 2 vectors v 𝑤 p : when 𝑥 is the center word v 𝑣 p : when 𝑥 is the outside word (context word) 6501 Natural Language Processing 14

How to update? cde (f ghij ⋅ l gh ) 𝑞 𝑥 \^4 𝑥 \ = ∑ cde (f gn ⋅ l gh ) gn v How to minimize 𝐾(𝜄) v Gradient descent! v How to compute the gradient? 6501 Natural Language Processing 15

Recap: Calculus v Gradient: 𝒚 ' = 𝑦 8 𝑦 5 𝑦 z , 𝜖𝜚(𝒚) 𝜖𝑦 8 𝜖𝜚(𝒚) ∇𝜚 𝒚 = 𝜖𝑦 5 𝜖𝜚(𝒚) 𝜖𝑦 z v 𝜚 𝒚 = 𝒃 ⋅ 𝒚 (or represented as 𝒃 ' 𝒚 ) ∇𝜚 𝒚 = 𝒃 6501 Natural Language Processing 16

Recap: Calculus v If 𝑧 = 𝑔 𝑣 and 𝑣 = 𝑕 𝑦 (i.e,. 𝑧 = 𝑔(𝑕 𝑦 ) ƒ„ ƒ†(f) ƒ‡(…) ƒ„ ƒf ƒ… = ( ƒ… ) ƒf ƒ… ƒf 1. 𝑧 = 𝑦 ˆ + 6 z (𝑦 5 + 5) 2. y = ln 3. y = exp(x • + 3𝑦 + 2) 6501 Natural Language Processing 17

Other useful formulation v 𝑧 = exp 𝑦 dy dx = exp x v y = log x dy dx = 1 x When I say log (in this course), usually I mean ln 6501 Natural Language Processing 18

6501 Natural Language Processing 19

Example v Assume vocabulary set is 𝑋. We have one center word 𝑑 , and one context word 𝑝 . v What is the conditional probability 𝑞 𝑝 𝑑 exp (𝑣 • ⋅ 𝑤 – ) 𝑞 𝑝 𝑑 = ∑ (𝑣 p n ⋅ 𝑤 – ) exp pV v What is the gradient of the log likelihood w.r.t 𝑤 – ? 𝜖 log 𝑞 𝑝 𝑑 = 𝑣 • − 𝐹 p∼™ 𝑥 𝑑 [𝑣 p ] 𝜖𝑤 – 6501 Natural Language Processing 20

Gradient Descent min p 𝐾(𝑥) Update w: 𝑥 ← 𝑥 − 𝜃∇𝐾(𝑥) 6501 Natural Language Processing 21

Local minimum v.s. global minimum 6501 Natural Language Processing 22

Stochastic gradient descent v Let 𝐾 𝑥 = 8 6 6 ∑ 𝐾 4 (𝑥) 478 v Gradient descent update rule: ž 6 6 ∑ 𝑥 ← 𝑥 − 𝛼𝐾 4 𝑥 478 v Stochastic gradient descent: v Approximate 8 6 6 ∑ 𝛼𝐾 4 𝑥 by the gradient at a 478 single example 𝛼𝐾 3 𝑥 (why?) v At each step: Randomly pick an example 𝑗 𝑥 ← 𝑥 − 𝜃𝛼𝐾 3 𝑥 6501 Natural Language Processing 23

Negative sampling v With a large vocabulary set, stochastic gradient descent is still not enough (why?) 𝜖 log𝑞 𝑝 𝑑 = 𝑣 • − 𝐹 p∼™ 𝑥 𝑑 [𝑣 p ] 𝜖𝑤 – v Let’s approximate it again! v Only sample a few words that do not appear in the context v Essentially, put more weights on positive samples 6501 Natural Language Processing 24

More about Word2Vec – relation to LSA v LSA factorizes a matrix of co-occurrence counts v (Levy and Goldberg 2014) proves that skip-gram model implicitly factorizes a (shifted) PMI matrix! ¡(„|…) ¡(…,„) v PMI(w,c) = log ¡(„) = log ™(…)¡(„) = log # 𝑥, 𝑑 ⋅ |𝐸| #(𝑥)#(𝑑) 6501 Natural Language Processing 25

All problem solved? 6501 Natural Language Processing 26

Continuous Semantic Representations sunny rainy cloudy windy car emotion cab sad wheel joy feeling 6501 Natural Language Processing 27

Semantics Needs More Than Similarity Tomorrow will be rainy. Tomorrow will be sunny. 𝑡𝑗𝑛𝑗𝑚𝑏𝑠( rainy, sunny ) ? 𝑏𝑜𝑢𝑝𝑜𝑧𝑛( rainy, sunny ) ? 6501 Natural Language Processing 28

Polarity Inducing LSA [Yih, Zweig, Platt 2012] v Data representation v Encode two opposite relations in a matrix using “polarity” v Synonyms & antonyms (e.g., from a thesaurus) v Factorization v Apply SVD to the matrix to find latent components v Measuring degree of relation v Cosine of latent vectors

Encode Synonyms & Antonyms in Matrix v Joyfulness: joy, gladden; sorrow, sadden v Sad: sorrow, sadden; joy, gladden Target word: row- Inducing polarity vector joy gladden sorrow sadden goodwill Group 1: “joyfulness” 1 1 -1 -1 0 Group 2: “sad” -1 -1 1 1 0 Group 3: “affection” 0 0 0 0 1 Cosine Score: + 𝑇𝑧𝑜𝑝𝑜𝑧𝑛𝑡

Encode Synonyms & Antonyms in Matrix v Joyfulness: joy, gladden; sorrow, sadden v Sad: sorrow, sadden; joy, gladden Target word: row- Inducing polarity vector joy gladden sorrow sadden goodwill Group 1: “joyfulness” 1 1 -1 -1 0 Group 2: “sad” -1 -1 1 1 0 Group 3: “affection” 0 0 0 0 1 Cosine Score: − 𝐵𝑜𝑢𝑝𝑜𝑧𝑛𝑡

Continuous representations for entities Democratic Party Republic Party ? George W Bush Laura Bush Michelle Obama 6501 Natural Language Processing 32

Continuous representations for entities • Useful resources for NLP applications • Semantic Parsing & Question Answering • Information Extraction 6501 Natural Language Processing 33

Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of - PowerPoint PPT Presentation

Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1 This lecture v Learning word vectors (Cont.) v Representation learning in

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

Lecture 8: NLP and Word Embeddings Alireza Akhavan Pour CLASS.VISION

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Lecture 26 Word Embeddings and Recurrent Nets Julia Hockenmaier juliahmr@illinois.edu 3324

the ESTRO formalism Maria Rosa Malisan 2 AAPM RPT 258 A protocol is presented for the

Family Week 2020 15 15-21 May Building Connections in the Spirit of Hope Who does CatholicCare

NO DEAL BREXIT CUSTOMS WORKSHOP FOR ACCREDITED TRADERS KAREN WHEELER, DIRECTOR-GENERAL BORDER

Statistics and learning Support Vector Machines S A c bastien Gadat Toulouse School of

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Towards Rapid-Acting Treatments for OCD C AROLY N R ODRIGUE Z , M.D ., P H .D . A S S I S T A N T P

in a Xenograft Model of NSCLC (NCI-H460) Richard Daifuku 1 1 Epigenetics Pharma, Mercer Island, WA,

Y P O C Methodological considerations T for tDCS O N O D MA Nitsche E S Leibniz

Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of - PowerPoint PPT Presentation

Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1 This lecture v Learning word vectors (Cont.) v Representation learning in

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

Lecture 8: NLP and Word Embeddings Alireza Akhavan Pour CLASS.VISION

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin How to

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Lecture 26 Word Embeddings and Recurrent Nets Julia Hockenmaier juliahmr@illinois.edu 3324

the ESTRO formalism Maria Rosa Malisan 2 AAPM RPT 258 A protocol is presented for the

Family Week 2020 15 15-21 May Building Connections in the Spirit of Hope Who does CatholicCare

NO DEAL BREXIT CUSTOMS WORKSHOP FOR ACCREDITED TRADERS KAREN WHEELER, DIRECTOR-GENERAL BORDER

Statistics and learning Support Vector Machines S A c bastien Gadat Toulouse School of

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Towards Rapid-Acting Treatments for OCD C AROLY N R ODRIGUE Z , M.D ., P H .D . A S S I S T A N T P

in a Xenograft Model of NSCLC (NCI-H460) Richard Daifuku 1 1 Epigenetics Pharma, Mercer Island, WA,

Y P O C Methodological considerations T for tDCS O N O D MA Nitsche E S Leibniz

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to