Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of - - PowerPoint PPT Presentation

β–Ά
lecture 7 word embeddings
SMART_READER_LITE
LIVE PREVIEW

Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of - - PowerPoint PPT Presentation

Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1 This lecture v Learning word vectors (Cont.) v Representation learning in


slide-1
SLIDE 1

Lecture 7: Word Embeddings

Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16

1 6501 Natural Language Processing

slide-2
SLIDE 2

This lecture

v Learning word vectors (Cont.) v Representation learning in NLP

2 6501 Natural Language Processing

slide-3
SLIDE 3

Recap: Latent Semantic Analysis

v Data representation

vEncode single-relational data in a matrix

v Co-occurrence (e.g., from a general corpus) v Synonyms (e.g., from a thesaurus)

v Factorization

vApply SVD to the matrix to find latent components

v Measuring degree of relation

vCosine of latent vectors

slide-4
SLIDE 4

Recap: Mapping to Latent Space via SVD v SVD generalizes the original data

v Uncovers relationships not explicit in the thesaurus v Term vectors projected to 𝑙-dim latent space

v Word similarity: cosine of two column vectors in πš»π–$

𝑫

𝐕

𝐖'

β‰ˆ

π‘’Γ—π‘œ 𝑒×𝑙 𝑙×𝑙 π‘™Γ—π‘œ

𝚻

slide-5
SLIDE 5

Low rank approximation

v Frobenius norm. C is a π‘›Γ—π‘œ matrix

||𝐷||/ = 1 1 |𝑑34|5

6 478 9 378

v Rank of a matrix.

vHow many vectors in the matrix are independent to each other

6501 Natural Language Processing 5

slide-6
SLIDE 6

Low rank approximation

v Low rank approximation problem: min

= ||𝐷 βˆ’ π‘Œ||/ 𝑑. 𝑒. π‘ π‘π‘œπ‘™ π‘Œ = 𝑙

v If I can only use k independent vectors to describe the points in the space, what are the best choices?

6501 Natural Language Processing 6

Essentially, we minimize the β€œreconstruction loss” under a low rank constraint

slide-7
SLIDE 7

Low rank approximation

v Low rank approximation problem: min

= ||𝐷 βˆ’ π‘Œ||/ 𝑑. 𝑒. π‘ π‘π‘œπ‘™ π‘Œ = 𝑙

v If I can only use k independent vectors to describe the points in the space, what are the best choices?

6501 Natural Language Processing 7

Essentially, we minimize the β€œreconstruction loss” under a low rank constraint

slide-8
SLIDE 8

Low rank approximation

v Assume rank of 𝐷 is r v SVD: 𝐷 = π‘‰Ξ£π‘Š', Ξ£ = diag(𝜏

8,𝜏5 … 𝜏P, 0,0,0, …0)

v Zero-out the r βˆ’ 𝑙 trailing values Ξ£β€² = diag(𝜏

8, 𝜏5 β€¦πœU, 0,0,0,… 0)

v 𝐷V = UΞ£Vπ‘Š' is the best k-rank approximation: CV = 𝑏𝑠𝑕 min

= ||𝐷 βˆ’ π‘Œ||/ 𝑑.𝑒. π‘ π‘π‘œπ‘™ π‘Œ = 𝑙

6501 Natural Language Processing 8

Ξ£ = 𝜏8 β‹± 𝑠 non-zeros

slide-9
SLIDE 9

Word2Vec

v LSA: a compact representation of co-

  • ccurrence matrix

v Word2Vec:Predict surrounding words (skip-gram) vSimilar to using co-occurrence counts Levy&Goldberg

(2014), Pennington et al. (2014)

v Easy to incorporate new words

  • r sentences

6501 Natural Language Processing 9

slide-10
SLIDE 10

Word2Vec

v Similar to language model, but predicting next word is not the goal.

v Idea: words that are semantically similar often

  • ccur near each other in text

v Embeddings that are good at predicting neighboring words are also good at representing similarity

6501 Natural Language Processing 10

slide-11
SLIDE 11

Skip-gram v.s Continuous bag-of-words v What differences?

6501 Natural Language Processing 11

slide-12
SLIDE 12

Skip-gram v.s Continuous bag-of-words

6501 Natural Language Processing 12

slide-13
SLIDE 13

Objective of Word2Vec (Skip-gram)

v Maximize the log likelihood of context word π‘₯\]9, π‘₯\]9^8, …, π‘₯\]8 , π‘₯\^8, π‘₯\^5, …, π‘₯\^9 given word π‘₯\ v m is usually 5~10

6501 Natural Language Processing 13

slide-14
SLIDE 14

Objective of Word2Vec (Skip-gram)

v How to model log 𝑄(π‘₯\^4|π‘₯\)? π‘ž π‘₯\^4 π‘₯\ =

cde (fghij β‹… lgh) βˆ‘ cde (fgn β‹… lgh)

gn

vsoftmax function Again!

v Every word has 2 vectors

v𝑀p : when π‘₯ is the center word v𝑣p: when π‘₯ is the outside word (context word)

6501 Natural Language Processing 14

slide-15
SLIDE 15

How to update?

π‘ž π‘₯\^4 π‘₯\ =

cde (fghij β‹… lgh) βˆ‘ cde (fgn β‹… lgh)

gn

v How to minimize 𝐾(πœ„)

vGradient descent! vHow to compute the gradient?

6501 Natural Language Processing 15

slide-16
SLIDE 16

Recap: Calculus

6501 Natural Language Processing 16

v Gradient:

π’š' = 𝑦8 𝑦5 𝑦z , βˆ‡πœš π’š = πœ–πœš(π’š) πœ–π‘¦8 πœ–πœš(π’š) πœ–π‘¦5 πœ–πœš(π’š) πœ–π‘¦z

v 𝜚 π’š = 𝒃 β‹… π’š (or represented as 𝒃'π’š)

βˆ‡πœš π’š = 𝒃

slide-17
SLIDE 17

Recap: Calculus

6501 Natural Language Processing 17

v If 𝑧 = 𝑔 𝑣 and 𝑣 = 𝑕 𝑦 (i.e,. 𝑧 = 𝑔(𝑕 𝑦 )

Ζ’β€ž ƒ… = Ġ(f) Ζ’f ƒ‑(…) ƒ…

(

Ζ’β€ž Ζ’f Ζ’f ƒ… )

  • 1. 𝑧 = 𝑦ˆ + 6 z
  • 2. y = ln

(𝑦5 + 5)

  • 3. y = exp(xβ€’ + 3𝑦 + 2)
slide-18
SLIDE 18

Other useful formulation

v 𝑧 = exp 𝑦 dy dx = exp x v y = log x dy dx = 1 x

6501 Natural Language Processing 18

When I say log (in this course), usually I mean ln

slide-19
SLIDE 19

6501 Natural Language Processing 19

slide-20
SLIDE 20

Example

v Assume vocabulary set is 𝑋. We have one center word 𝑑, and one context word 𝑝. v What is the conditional probability π‘ž 𝑝 𝑑 π‘ž 𝑝 𝑑 = exp (𝑣‒ β‹… 𝑀–) βˆ‘ exp (𝑣pn β‹… 𝑀–)

pV

v What is the gradient of the log likelihood w.r.t 𝑀–?

πœ– log π‘ž 𝑝 𝑑 πœ–π‘€β€“ = 𝑣‒ βˆ’ 𝐹pβˆΌβ„’ π‘₯ 𝑑 [𝑣p]

6501 Natural Language Processing 20

slide-21
SLIDE 21

Gradient Descent

min

p 𝐾(π‘₯)

Update w: π‘₯ ← π‘₯ βˆ’ πœƒβˆ‡πΎ(π‘₯)

6501 Natural Language Processing 21

slide-22
SLIDE 22

Local minimum v.s. global minimum

6501 Natural Language Processing 22

slide-23
SLIDE 23

Stochastic gradient descent

v Let 𝐾 π‘₯ = 8

6 βˆ‘

𝐾4(π‘₯)

6 478

v Gradient descent update rule:

π‘₯ ← π‘₯ βˆ’

ΕΎ 6 βˆ‘

𝛼𝐾4 π‘₯

6 478

v Stochastic gradient descent:

v Approximate 8

6βˆ‘

𝛼𝐾4 π‘₯

6 478

by the gradient at a single example 𝛼𝐾3 π‘₯ (why?) v At each step:

6501 Natural Language Processing 23

Randomly pick an example 𝑗 π‘₯ ← π‘₯ βˆ’ πœƒπ›ΌπΎ3 π‘₯

slide-24
SLIDE 24

Negative sampling

v With a large vocabulary set, stochastic gradient descent is still not enough (why?)

πœ– logπ‘ž 𝑝 𝑑 πœ–π‘€β€“ = 𝑣‒ βˆ’ 𝐹pβˆΌβ„’ π‘₯ 𝑑 [𝑣p]

vLet’s approximate it again!

vOnly sample a few words that do not appear in the context vEssentially, put more weights on positive samples

6501 Natural Language Processing 24

slide-25
SLIDE 25

More about Word2Vec – relation to LSA v LSA factorizes a matrix of co-occurrence counts v (Levy and Goldberg 2014) proves that skip-gram model implicitly factorizes a (shifted) PMI matrix! v PMI(w,c) =log

Β‘(β€ž|…) Β‘(β€ž) = log Β‘(…,β€ž) β„’(…)Β‘(β€ž)

= log # π‘₯, 𝑑 β‹… |𝐸| #(π‘₯)#(𝑑)

6501 Natural Language Processing 25

slide-26
SLIDE 26

All problem solved?

6501 Natural Language Processing 26

slide-27
SLIDE 27

Continuous Semantic Representations

sunny rainy windy cloudy car wheel cab sad joy emotion feeling

6501 Natural Language Processing 27

slide-28
SLIDE 28

Semantics Needs More Than Similarity

Tomorrow will be rainy. Tomorrow will be sunny.

π‘‘π‘—π‘›π‘—π‘šπ‘π‘ (rainy, sunny)? π‘π‘œπ‘’π‘π‘œπ‘§π‘›(rainy, sunny)?

6501 Natural Language Processing 28

slide-29
SLIDE 29

Polarity Inducing LSA [Yih, Zweig, Platt 2012]

v Data representation

vEncode two opposite relations in a matrix using β€œpolarity”

v Synonyms & antonyms (e.g., from a thesaurus)

v Factorization

vApply SVD to the matrix to find latent components

v Measuring degree of relation

vCosine of latent vectors

slide-30
SLIDE 30

joy gladden sorrow sadden goodwill Group 1: β€œjoyfulness” 1 1

  • 1
  • 1

Group 2: β€œsad”

  • 1
  • 1

1 1

Group 3: β€œaffection”

1

Encode Synonyms & Antonyms in Matrix

v Joyfulness: joy, gladden; sorrow, sadden v Sad: sorrow, sadden; joy, gladden Inducing polarity Cosine Score: + π‘‡π‘§π‘œπ‘π‘œπ‘§π‘›π‘‘ Target word: row- vector

slide-31
SLIDE 31

joy gladden sorrow sadden goodwill Group 1: β€œjoyfulness” 1 1

  • 1
  • 1

Group 2: β€œsad”

  • 1
  • 1

1 1

Group 3: β€œaffection”

1

Encode Synonyms & Antonyms in Matrix

v Joyfulness: joy, gladden; sorrow, sadden v Sad: sorrow, sadden; joy, gladden Inducing polarity Cosine Score: βˆ’ π΅π‘œπ‘’π‘π‘œπ‘§π‘›π‘‘ Target word: row- vector

slide-32
SLIDE 32

Continuous representations for entities

6501 Natural Language Processing 32

? Michelle Obama Democratic Party George W Bush Laura Bush Republic Party

slide-33
SLIDE 33

Continuous representations for entities

6501 Natural Language Processing 33

  • Useful resources for NLP applications
  • Semantic Parsing & Question Answering
  • Information Extraction