[PPT] - Improving Distributional Similarity with Lessons Learned from Word PowerPoint Presentation

SLIDE 1

Improving Distributional Similarity with Lessons Learned from Word Embeddings

Authors: Omar Levy, Yoav Goldberg, Ido Dagan Presentation: Collin Gress

SLIDE 2

Motivation

u We want to do NLP tasks. How do we represent words?

SLIDE 3

Motivation

u We want to do NLP tasks. How do we represent words? u We generally want vectors. Think neural networks

SLIDE 4

Motivation

u We want to do NLP tasks. How do we represent words? u We generally want vectors. Think neural networks u What are some ways to get vector representations of words?

SLIDE 5

Motivation

u We want to do NLP tasks. How do we represent words? u We generally want vectors. Think neural networks u What are some ways to get vector representations of words? u Distributional hypothesis: "words that are used and occur in the same

contexts tend to purport similar meanings” - Wikipedia

SLIDE 6

Vector representations of words and their surrounding contexts

u Word2vec [1]

SLIDE 7

Vector representations of words and their surrounding contexts

u Word2vec [1] u Glove [2]

SLIDE 8

Vector representations of words and their surrounding contexts

u Word2vec [1] u Glove [2] u PMI – Pointwise mutual Information

SLIDE 9

Vector representations of words and their surrounding contexts

u Word2vec [1] u Glove [2] u PMI – Pointwise mutual information u SVD of PMI – Singular value decomposition of PMI matrix

SLIDE 10

Very briefly: Word2vec

u This paper focuses on Skip-Gram negative sampling (SGNS) which predicts

context words based off of a target word

SLIDE 11

Very briefly: Word2vec

u This paper focuses on Skip-Gram negative sampling (SGNS) which predicts

context words based off of a target word

u Optimization problem solvable by gradient descent. Want to maximize ! ⋅ #

for word context pairs that exist in the dataset, and minimize it for “hallucinated” word context pairs [0].

SLIDE 12

Very briefly: Word2vec

u This paper focuses on Skip-Gram negative sampling (SGNS) which predicts

context words based off of a target word

u Optimization problem solvable by gradient descent. Want to maximize ! ⋅ #

for word context pairs that exist in the dataset, and minimize it for “hallucinated” word context pairs [0].

u For every real word-context pair in dataset, hallucinate % word-context pairs.

That is, given some word target, draw % contexts from &'(#) =

+,-./(+) ∑ +,-./(+1)

21

SLIDE 13

Very briefly: Word2vec

u This paper focuses on Skip-Gram negative sampling (SGNS) which predicts

context words based off of a target word

u Optimization problem solvable by gradient descent. Want to maximize ! ⋅ #

for word context pairs that exist in the dataset, and minimize it for “hallucinated” word context pairs [0].

u For every real word-context pair in dataset, hallucinate % word-context pairs.

That is, given some word target, draw % contexts from &'(#) =

+,-./(+) ∑ +,-./(+1)

21

u End up with a set of vectors, !3 ∈ ℝ6 for every word in dataset. Similarly, set

f vectors, # 3 ∈ ℝ6 for each context in the dataset.

See Mikolov paper [1] for details

SLIDE 14

Very briefly: Glove

u Learn !-dimensional vectors " and #

⃗ as well as word and context specific scalars, %& and %' such that " ⋅ # ⃗ + %& + %' = log(count ", # ) for all word context pairs in data set [0]

SLIDE 15

Very briefly: Glove

u Learn !-dimensional vectors " and #

⃗ as well as word and context specific scalars, %& and %' such that " ⋅ # ⃗ + %& + %' = log(count ", # ) for all word context pairs in data set [0]

u Objective “solved” by factorization of the log count matrix, 5678

(':;<= &,' ),

> ⋅ ?@ + %& + %'

⃗

SLIDE 16

Very briefly: Pointwise mutual information (PMI)

u !"# $, & = log

( - .,/

. - / )

SLIDE 17

PMI: example

Source: https://en.wikipedia.org/wiki/Pointwise_mutual_information

SLIDE 18

PMI matrices for word, context pairs in practice

u Very sparse

SLIDE 19

Interesting relationships between PMI and SGNS; PMI and Glove

u SGNS is implicitly factorizing PMI shifted by some constant [0]. Specifically,

SGNS finds optimal vectors, ! and " ⃗, such that ! ⋅ " ⃗ = &'( !, " − log (0). 2 ⋅ 34 = ' − log 0

SLIDE 20

Interesting relationships between PMI and SGNS; PMI and Glove

u SGNS is implicitly factorizing PMI shifted by some constant [0]. Specifically,

SGNS finds optimal vectors, ! and " ⃗, such that ! ⋅ " ⃗ = &'( !, " − log (0). 2 ⋅ 34 = ' − log 0

u Recall that, in Glove we learn vectors d-dimensional vectors ! and "

⃗ as well as word and context specific scalars, 56 and 57 such that ! ⋅ " ⃗ + 56 + 57 = log(count !, " ) for all word context pairs in data set

SLIDE 21

Interesting relationships between PMI and SGNS; PMI and Glove

u SGNS is implicitly factorizing PMI shifted by some constant [0]. Specifically,

SGNS finds optimal vectors, ! and " ⃗, such that ! ⋅ " ⃗ = &'( !, " − log (0). 2 ⋅ 34 = ' − log 0

u Recall that, in Glove we learn vectors d-dimensional vectors ! and "

⃗ as well as word and context specific scalars, 56 and 57 such that ! ⋅ " ⃗ + 56 + 57 = log(count !, " ) for all word context pairs in data set

u If we fix 56 and 57 such that 56 = log

("=>?@ ! ) and 57 = log ("=>?@ " ), we get a problem nearly equivalent to factorizing PMI matrix shifted by log ( A ). I.e. 2 ⋅ 34 + 56 +57

⃗ = ' − log( A )

SLIDE 22

Interesting relationships between PMI and SGNS; PMI and Glove

u SGNS is implicitly factorizing PMI shifted by some constant [0]. Specifically,

SGNS finds optimal vectors, ! and " ⃗, such that ! ⋅ " ⃗ = &'( !, " − log (0). 2 ⋅ 34 = ' − log 0

u Recall that, in Glove we learn vectors d-dimensional vectors ! and "

⃗ as well as word and context specific scalars, 56 and 57 such that ! ⋅ " ⃗ + 56 + 57 = log(count !, " ) for all word context pairs in data set

u If we fix 56 and 57 such that 56 = log

("=>?@ ! ) and 57 = log ("=>?@ " ), we get a problem nearly equivalent to factorizing PMI matrix shifted by log ( A ). I.e. 2 ⋅ 34 + 56 +57

⃗ = ' − log( A )

u Or in simple terms, SGNS (Word2vec) and Glove aren’t too different from PMI

SLIDE 23

Very briefly: SVD of PMI matrix

u Singular value decomposition of PMI gives us dense vectors

SLIDE 24

Very briefly: SVD of PMI matrix

u Singular value decomposition of PMI gives us dense vectors u Factorize PMI matrix, ! into product of three matrices i.e. " ⋅ $ ⋅ %&

SLIDE 25

Very briefly: SVD of PMI matrix

u Singular value decomposition of PMI gives us dense vectors u Factorize PMI matrix, ! into product of three matrices i.e. " ⋅ $ ⋅ %& u Why does that help? ' = ") ⋅ $) and * = %

)

SLIDE 26

Thesis

u The performance gains of word embeddings are mainly attributed to the

ptimization of algorithm hyperparameters by algorithm designers rather than

the algorithms themselves

SLIDE 27

Thesis

u The performance gains of word embeddings are mainly attributed to the

ptimization of algorithm hyperparameters by algorithm designers rather than

the algorithms themselves

u PMI and SVD baselines used for comparison in embedding papers were the

most “vanilla” versions, hence the apparent superiority of embedding algorithms

SLIDE 28

Thesis

u The performance gains of word embeddings are mainly attributed to the

ptimization of algorithm hyperparameters by algorithm designers rather than

the algorithms themselves

u PMI and SVD baselines used for comparison in embedding papers were the

most “vanilla” versions, hence the apparent superiority of embedding algorithms

u Hyperparameters of Glove and Word2vec can be applied to PMI and SVD,

drastically improving their performance

SLIDE 29

Pre-processing Hyperparameters

u Dynamic Context Window: context word counts are weighted by their distance

to the target word. Word2vec does this by setting each context word’s weight to

!"#$%! '"() * $"'+,#-) . / !"#$%! '"()

. Glove uses

/ $"'+,#-)

SLIDE 30

Pre-processing Hyperparameters

u Dynamic Context Window: context word counts are weighted by their distance

to the target word. Word2vec does this by setting each context word’s weight to !"#$%! '"() * $"'+,#-) . /

!"#$%! '"()

. Glove uses

/ $"'+,#-)

u Subsampling: remove very frequent words from corpus. Word2vec does this by

removing words which are more frequent than some threshold 0 with probability 1 −

+ 3

where 4 is the corpus wide frequency of a word.

SLIDE 31

Pre-processing Hyperparameters

u Dynamic Context Window: context word counts are weighted by their distance

to the target word. Word2vec does this by setting each context word’s weight to !"#$%! '"() * $"'+,#-) . /

!"#$%! '"()

. Glove uses

/ $"'+,#-)

u Subsampling: remove very frequent words from corpus. Word2vec does this by

removing words which are more frequent than some threshold 0 with probability 1 −

+ 3

where 4 is the corpus wide frequency of a word.

u Subsampling can be “dirty” or “clean”. In dirty subsampling, words are

removed before word context pairs are formed. In clean subsampling, it is done after.

SLIDE 32

Pre-processing Hyperparameters

u Dynamic Context Window: context word counts are weighted by their distance

to the target word. Word2vec does this by setting each context word’s weight to

!"#$%! '"() * $"'+,#-) . / !"#$%! '"()

. Glove uses

/ $"'+,#-)

u Subsampling: remove very frequent words from corpus. Word2vec does this by

removing words which are more frequent than some threshold 0 with probability 1 −

+ 3

where 4 is the corpus wide frequency of a word.

u Subsampling can be “dirty” or “clean”. In dirty subsampling, words are

removed before word context pairs are formed. In clean subsampling, it is done after.

u Deleting Rare Words: exactly what you would expect. Negligible effect on

performance

SLIDE 33

Association Metric Hyperparameters

u Shifted PMI: As previously discussed, SGNS implicitly factorizes the PMI matrix

shifted by log (&). When working with PMI matrices, we can simply apply this transformation by picking some constant &, meaning each cell of the PMI matrix is ()* +, - − log (&)

SLIDE 34

Association Metric Hyperparameters

u Shifted PMI: As previously discussed, SGNS implicitly factorizes the PMI matrix

shifted by log (&). When working with PMI matrices, we can simply apply this transformation by picking some constant &, meaning each cell of the PMI matrix is ()* +, - − log (&)

u Context Distribution Smoothing: used in Word2vec to smooth the context

distribution for negative sampling. /0(-) =

(23456 2 )7 ∑ (23456 29 )7

:;

where < is some

constant. Can be used in PMI in the same sort of way.

()*= +, - = >?@

A(B,2) A B A7(2) where /= c = (23456 2 )7 ∑ (23456 29 )7

:;

SLIDE 35

Association Metric Hyperparameters

u Shifted PMI: As previously discussed, SGNS implicitly factorizes the PMI matrix

shifted by log (&). When working with PMI matrices, we can simply apply this transformation by picking some constant &, meaning each cell of the PMI matrix is ()* +, - − log (&)

u Context Distribution Smoothing: used in Word2vec to smooth the context

distribution for negative sampling. /0(-) =

(23456 2 )7 ∑ (23456 29 )7

:;

where < is some

constant. Can be used in PMI in the same sort of way.

()*= +, - = >?@

A(B,2) A B A7(2) where /= c = (23456 2 )7 ∑ (23456 29 )7

:;

u Context distribution smoothing helps to correct PMI’s bias towards word

context pairs where the context is rare

SLIDE 36

Post-processing Hyperparameters

u Adding Context Vectors: Glove paper [2] suggests that it is useful to return

! + # ⃗ for word representations rather than just !

SLIDE 37

Post-processing Hyperparameters

u Adding Context Vectors: Glove paper [2] suggests that it is useful to return

! + # ⃗ for word representations rather than just !

u New ! holds more information. Previously, two vectors would have high

cosine similarity if the two words are replaceable with one another. In the new form, weight is also awarded if the one word tends to appear in the context of the other [0].

SLIDE 38

Post-processing Hyperparameters

u Adding Context Vectors: Glove paper [2] suggests that it is useful to return

" + $ ⃗ for word representations rather than just "

u New " holds more information. Previously, two vectors would have high

cosine similarity if the two words are replaceable with one another. In the new form, weight is also awarded if the one word tends to appear in the context of the other [0].

u Eigenvalue Weighting: In SVD, weight the eigenvalue matrix by raising value

to some power, &. Then define ' = )* + Σ*

SLIDE 39

Post-processing Hyperparameters

u Adding Context Vectors: Glove paper [2] suggests that it is useful to return

! + # ⃗ for word representations rather than just !

u New ! holds more information. Previously, two vectors would have high

cosine similarity if the two words are replaceable with one another. In the new form, weight is also awarded if the one word tends to appear in the context of the other [0].

u Eigenvalue Weighting: In SVD, weight the eigenvalue matrix by raising value

to some power, %. Then define & = () * Σ)

,

u Authors observed that SGNS results in ”symmetric” weight and context

matrices. I.e. neither is orthonormal and no bias given to either in training
bjective [0]. Symmetry obtainable in SVD by letting % = -

.

SLIDE 40

Post-processing Hyperparameters

u Adding Context Vectors: Glove paper [2] suggests that it is useful to return

! + # ⃗ for word representations rather than just !

u New ! holds more information. Previously, two vectors would have high

cosine similarity if the two words are replaceable with one another. In the new form, weight is also awarded if the one word tends to appear in the context of the other [0].

u Eigenvalue Weighting: In SVD, weight the eigenvalue matrix by raising value

to some power, %. Then define & = () * Σ)

,

u Authors observed that SGNS results in ”symmetric” weight and context

matrices. I.e. neither is orthonormal and no bias given to either in training
bjective [0]. Symmetry obtainable in SVD by letting % = -

.

SLIDE 41

Post-processing Hyperparameters

u

Adding Context Vectors: Glove paper [2] suggests that it is useful to return ! + # ⃗ for word representations rather than just !

u

New ! holds more information. Previously, two vectors would have high cosine similarity if the two words are replaceable with one another. In the new form, weight is also awarded if the one word tends to appear in the context of the other

[0].

u

Eigenvalue Weighting: In SVD, weight the eigenvalue matrix by raising value to some power, %. Then define & = () * Σ)

,

u

Authors observed that SGNS results in ”symmetric” weight and context matrices. I.e. neither is orthonormal and no bias given to either in training objective [0]. Symmetry obtainable in SVD by letting % = -

.

u

Vector Normalization: general assumption is to normalize word vectors with /. normalization, may be worthwhile to experiment with.

SLIDE 42

Experiments Setup: Hyperparameter Space

SLIDE 43

Experiment Setup: Training

u Train on English Wikipedia dump. 77.5 million sentences, 1.5 billion tokens

SLIDE 44

Experiment Setup: Training

u Train on English Wikipedia dump. 77.5 million sentences, 1.5 billion tokens u Use ! = 500 for SVD, SGNS, Glove

SLIDE 45

Experiment Setup: Training

u Train on English Wikipedia dump. 77.5 million sentences, 1.5 billion tokens u Use ! = 500 for SVD, SGNS, Glove u Glove trained for 50 iterations

SLIDE 46

Experiment Setup: Testing

u Similarity task: Models used for word similarity tasks on six datasets. Each

dataset human labeled with word pair similarity scores

SLIDE 47

Experiment Setup: Testing

u Similarity task: Models used for word similarity tasks on six datasets. Each

dataset human labeled with word pair similarity scores

u Similarity scores calculated with cosine similarity

SLIDE 48

Experiment Setup: Testing

u Similarity task: Models used for word similarity tasks on six datasets. Each

dataset human labeled with word pair similarity scores

u Similarity scores calculated with cosine similarity u Analogy task: two analogy datasets used. Given analogies of the form ! is to

!∗ as # is to #∗ where #∗ is not given. Example: “Paris is to France as Tokyo is to _”.

SLIDE 49

Experiment Setup: Testing

u Similarity task: Models used for word similarity tasks on six datasets. Each

dataset human labeled with word pair similarity scores

u Similarity scores calculated with cosine similarity u Analogy task: two analogy datasets used. Given analogies of the form ! is to

!∗ as # is to #∗ where #∗ is not given. Example: “Paris is to France as Tokyo is to _”.

u Analogies calculated using 3CosAdd and 3CosMul

SLIDE 50

Experiment Setup: Testing

u Similarity task: Models used for word similarity tasks on six datasets. Each

dataset human labeled with word pair similarity scores

u Similarity scores calculated with cosine similarity u Analogy task: two analogy datasets used. Given analogies of the form ! is to

!∗ as # is to #∗ where #∗ is not given. Example: “Paris is to France as Tokyo is to _”.

u Analogies calculated using 3CosAdd and 3CosMul u 3CosAdd: !$% '!()∗∈+

,\{/,/∗,)}(cos #∗, !∗ − cos #∗, ! + cos #∗, # )

SLIDE 51

Experiment Setup: Testing

u Similarity task: Models used for word similarity tasks on six datasets. Each

dataset human labeled with word pair similarity scores

u Similarity scores calculated with cosine similarity u Analogy task: two analogy datasets used. Given analogies of the form ! is to

!∗ as # is to #∗ where #∗ is not given. Example: “Paris is to France as Tokyo is to _”.

u Analogies calculated using 3CosAdd and 3CosMul u 3CosAdd: !%& '!()∗∈+

,\{/,/∗,)} cos #∗, !∗ − cos #∗, ! + cos #∗, #

u 3CosMul: !%& '!()∗∈+

,\{/,/∗,)}

789 ()∗,/∗)<789 ()∗,)) 789 )∗,/ =>

where ? = .001

SLIDE 52

Experiment Results

SLIDE 53

Experiment Results Cont.

SLIDE 54

Key Takeaways

u Average score of SGNS (Word2vec) is lower than SVD for window sizes 2, 5.

SGNS never outperforms SVD by more than 1.7%

SLIDE 55

Key Takeaways

u Average score of SGNS (Word2vec) is lower than SVD for window sizes 2, 5.

SGNS never outperforms SVD by more than 1.7%

u Previous results to the contrary due to using tuned Word2vec vs vanilla SVD

SLIDE 56

Key Takeaways

u Average score of SGNS (Word2vec) is lower than SVD for window sizes 2, 5.

SGNS never outperforms SVD by more than 1.7%

u Previous results to the contrary due to using tuned Word2vec vs vanilla SVD u Glove only superior to SGNS on analogy tasks using 3CosAdd (generally

considered inferior to 3CosMult)

SLIDE 57

Key Takeaways

u Average score of SGNS (Word2vec) is lower than SVD for window sizes 2, 5.

SGNS never outperforms SVD by more than 1.7%

u Previous results to the contrary due to using tuned Word2vec vs vanilla SVD u Glove only superior to SGNS on analogy tasks using 3CosAdd (generally

considered inferior to 3CosMult)

u CBOW seems to only perform well on MSR analogy data set

SLIDE 58

Key Takeaways

u Average score of SGNS (Word2vec) is lower than SVD for window sizes 2, 5.

SGNS never outperforms SVD by more than 1.7%

u Previous results to the contrary due to using tuned Word2vec vs vanilla SVD u Glove only superior to SGNS on analogy tasks using 3CosAdd (generally

considered inferior to 3CosMult)

u CBOW seems to only perform well on MSR analogy data set u SVD does not benefit from shifting PMI matrix

SLIDE 59

Key Takeaways

u Average score of SGNS (Word2vec) is lower than SVD for window sizes 2, 5.

SGNS never outperforms SVD by more than 1.7%

u Previous results to the contrary due to using tuned Word2vec vs vanilla SVD u Glove only superior to SGNS on analogy tasks using 3CosAdd (generally

considered inferior to 3CosMult)

u CBOW seems to only perform well on MSR analogy data set u SVD does not benefit from shifting PMI matrix u Using SVD with an eigenvalue weighting of 1 results in poor performance

compared to .5 or 0

SLIDE 60

Recommendations

u Tune hyperparameters based on task at hand (duh)

SLIDE 61

Recommendations

u Tune hyperparameters based on task at hand (duh) u If you’re using PMI, always use context distribution smoothing

SLIDE 62

Recommendations

u Tune hyperparameters based on task at hand (duh) u If you’re using PMI, always use context distribution smoothing u If you’re using SVD, always use eigenvalue weighting

SLIDE 63

Recommendations

u Tune hyperparameters based on task at hand (duh) u If you’re using PMI, always use context distribution smoothing u If you’re using SVD, always use eigenvalue weighting u SGNS always performs well and is computationally efficient to train and

utilize

SLIDE 64

Recommendations

u Tune hyperparameters based on task at hand (duh) u If you’re using PMI, always use context distribution smoothing u If you’re using SVD, always use eigenvalue weighting u SGNS always performs well and is computationally efficient to train and

utilize

u With SGNS, use many negative examples, i.e. prefer larger !

SLIDE 65

Recommendations

u Tune hyperparameters based on task at hand (duh) u If you’re using PMI, always use context distribution smoothing u If you’re using SVD, always use eigenvalue weighting u SGNS always performs well and is computationally efficient to train and

utilize

u With SGNS, use many negative examples, i.e. prefer larger ! u Experiment with " = " + &

⃗ variation in SGNS and Glove

SLIDE 66

References

u [0] Levy, Omer, Yoav Goldberg, and Ido Dagan. "Improving distributional

similarity with lessons learned from word embeddings." Transactions of the Association for Computational Linguistics 3 (2015): 211-225.

u [1] Mikolov, Tomas, et al. "Distributed representations of words and phrases

and their compositionality." Advances in neural information processing

systems. 2013.

u [2] Pennington, Jeffrey, Richard Socher, and Christopher Manning. "Glove:

Global vectors for word representation." Proceedings of the 2014 conference

n empirical methods in natural language processing (EMNLP). 2014.