Analogies Explained Towards Understanding Word Embeddings Carl - - PowerPoint PPT Presentation

analogies explained
SMART_READER_LITE
LIVE PREVIEW

Analogies Explained Towards Understanding Word Embeddings Carl - - PowerPoint PPT Presentation

Analogies Explained Towards Understanding Word Embeddings Carl Allen, Tim Hospedales June 13 2019 School of Informatics, University of Edinburgh The Problem: linking semantics to geometry from: man is to king as woman is to queen


slide-1
SLIDE 1

Analogies Explained

Towards Understanding Word Embeddings

Carl Allen, Tim Hospedales June 13 2019

School of Informatics, University of Edinburgh

slide-2
SLIDE 2

The Problem: linking semantics to geometry

from: “man is to king as woman is to queen” explain: wking wman wwoman wqueen

  • r rather:

?

P1

?

P1

1

slide-3
SLIDE 3

The Problem: linking semantics to geometry

from: “man is to king as woman is to queen” explain: wking − wman + wwoman ≈ wqueen

  • r rather:

?

P1

?

P1

1

slide-4
SLIDE 4

The Problem: linking semantics to geometry

from: “man is to king as woman is to queen” explain: wking − wman + wwoman ≈ wqueen

  • r rather:

woman king queen man permitting auxiliary royal crown sol reign princess lord prince wK − wM + wW

?

P1

woman queen prince wK − wM

?

P1

1

slide-5
SLIDE 5

Word2Vec: SkipGram with Negative Sampling

Mikolov et al. (2013a,b)

w1 w2 w3 wn target words (E) c1 c2 c3 cn context words (E) . . . . . .

W C

  • p(cj|wi) by softmax expensive

use sigmoid with negative sampling (k) Levy and Goldberg (2014) wi cj

p wi cj p wi p cj

k PMI wi cj k W C PMI k

2

slide-6
SLIDE 6

Word2Vec: SkipGram with Negative Sampling

Mikolov et al. (2013a,b)

w1 w2 w3 wn target words (E) c1 c2 c3 cn context words (E) . . . . . .

W C

  • p(cj|wi) by softmax expensive
  • use sigmoid with negative

sampling (k) Levy and Goldberg (2014) wi cj

p wi cj p wi p cj

k PMI wi cj k W C PMI k

2

slide-7
SLIDE 7

Word2Vec: SkipGram with Negative Sampling

Mikolov et al. (2013a,b)

w1 w2 w3 wn target words (E) c1 c2 c3 cn context words (E) . . . . . .

W C

  • p(cj|wi) by softmax expensive
  • use sigmoid with negative

sampling (k)

  • Levy and Goldberg (2014)

w⊤

i cj ≈ log p(wi,cj) p(wi)p(cj) − log k

= PMI(wi, cj) − log k W C PMI k

2

slide-8
SLIDE 8

Word2Vec: SkipGram with Negative Sampling

Mikolov et al. (2013a,b)

w1 w2 w3 wn target words (E) c1 c2 c3 cn context words (E) . . . . . .

W C

  • p(cj|wi) by softmax expensive
  • use sigmoid with negative

sampling (k)

  • Levy and Goldberg (2014)

w⊤

i cj ≈ log p(wi,cj) p(wi)p(cj) − log k

= PMI(wi, cj) − log k W⊤C ≈ PMI − log k

2

slide-9
SLIDE 9

Routemap

“man is to king as woman is to queen” man transforms to king as woman transforms to queen {woman, king} paraphrases {man, queen} PMIking PMIman PMIwoman PMIqueen wking wman wwoman wqueen

geometric semantic

3

slide-10
SLIDE 10

Routemap

“man is to king as woman is to queen” ⇕ man transforms to king as woman transforms to queen {woman, king} paraphrases {man, queen} PMIking PMIman PMIwoman PMIqueen wking wman wwoman wqueen

geometric semantic

3

slide-11
SLIDE 11

Routemap

“man is to king as woman is to queen” ⇕ man transforms to king as woman transforms to queen ⇕ {woman, king} paraphrases {man, queen} PMIking PMIman PMIwoman PMIqueen wking wman wwoman wqueen

geometric semantic

3

slide-12
SLIDE 12

Routemap

“man is to king as woman is to queen” ⇕ man transforms to king as woman transforms to queen ⇕ {woman, king} paraphrases {man, queen} ⇓ PMIking − PMIman + PMIwoman ≈ PMIqueen wking wman wwoman wqueen

geometric semantic

3

slide-13
SLIDE 13

Routemap

“man is to king as woman is to queen” ⇕ man transforms to king as woman transforms to queen ⇕ {woman, king} paraphrases {man, queen} ⇓ PMIking − PMIman + PMIwoman ≈ PMIqueen ⇓ wking − wman + wwoman ≈ wqueen

geometric semantic

3

slide-14
SLIDE 14

Routemap

“man is to king as woman is to queen” ⇕ man transforms to king as woman transforms to queen ⇕ {woman, king} paraphrases {man, queen} ⇓ PMIking − PMIman + PMIwoman ≈ PMIqueen ⇓ wking − wman + wwoman ≈ wqueen

geometric semantic

3

slide-15
SLIDE 15

Routemap

“man is to king as woman is to queen” ⇕ man transforms to king as woman transforms to queen ⇕ {woman, king} paraphrases {man, queen} ⇓ PMIking − PMIman + PMIwoman ≈ PMIqueen ⇓ wking − wman + wwoman ≈ wqueen

geometric semantic

4

slide-16
SLIDE 16

Routemap

“man is to king as woman is to queen” ⇕ man transforms to king as woman transforms to queen ⇕ {woman, king} paraphrases {man, queen} ⇓ PMIking − PMIman + PMIwoman ≈ PMIqueen ⇓

PMIi ≈ w⊤

i C

wking − wman + wwoman ≈ wqueen

geometric semantic

5

slide-17
SLIDE 17

Paraphrase† of W by w∗

Intuition: word w∗ ∈E paraphrases word set W ={w1, ..., wm}⊆E, if w∗ and W are semantically interchangeable.

w1 wn

E p(E|w∗) p(E|W)

Definition (D1): w paraphrases , l, if paraphrase error

w n is (element-wise) small:

w

j p cj w p cj

cj Inspired by Gittens et al. (2017)

6

slide-18
SLIDE 18

Paraphrase† of W by w∗

Intuition: word w∗ ∈E paraphrases word set W ={w1, ..., wm}⊆E, if w∗ and W are semantically interchangeable.

w1 wn

E p(E|w∗) p(E|W)

Definition (D1): w∗ ∈E paraphrases W ⊆E, |W|<l, if paraphrase error ρW,w∗ ∈Rn is (element-wise) small: ρ

W,w∗

j

= log p(cj|w∗)

p(cj|W) , cj ∈E †Inspired by Gittens et al. (2017) 6

slide-19
SLIDE 19

Summing PMI vectors of a paraphrase

PMI1 + PMI2 ≈ PMI∗ ? PMI w cj PMI w1 cj PMI w2 cj p w cj p w p w1 cj p w2 cj p w1 p w2 p cj p cj p p p cj w p cj

w j

paraphrase error

p cj p w1 cj p w2 cj

j

conditional independence error

p p w1 p w2

independence error

Lemma 1: For any word w and word set , l: PMI

wi

PMIi

w

1

7

slide-20
SLIDE 20

Summing PMI vectors of a paraphrase

PMI1 + PMI2 ≈ PMI∗ ? PMI(w∗, cj) − ( PMI(w1,cj) + PMI(w2,cj) ) p w cj p w p w1 cj p w2 cj p w1 p w2 p cj p cj p p p cj w p cj

w j

paraphrase error

p cj p w1 cj p w2 cj

j

conditional independence error

p p w1 p w2

independence error

Lemma 1: For any word w and word set , l: PMI

wi

PMIi

w

1

7

slide-21
SLIDE 21

Summing PMI vectors of a paraphrase

PMI1 + PMI2 ≈ PMI∗ ? PMI(w∗, cj) − ( PMI(w1,cj) + PMI(w2,cj) ) = log p(w∗|cj) p(w∗) − log p(w1|cj)p(w2|cj) p(w1)p(w2) p cj p cj p p p cj w p cj

w j

paraphrase error

p cj p w1 cj p w2 cj

j

conditional independence error

p p w1 p w2

independence error

Lemma 1: For any word w and word set , l: PMI

wi

PMIi

w

1

7

slide-22
SLIDE 22

Summing PMI vectors of a paraphrase

PMI1 + PMI2 ≈ PMI∗ ? PMI(w∗, cj) − ( PMI(w1,cj) + PMI(w2,cj) ) = log p(w∗|cj) p(w∗) − log p(w1|cj)p(w2|cj) p(w1)p(w2) + log p(W|cj) p(W|cj) + log p(W) p(W) p cj w p cj

w j

paraphrase error

p cj p w1 cj p w2 cj

j

conditional independence error

p p w1 p w2

independence error

Lemma 1: For any word w and word set , l: PMI

wi

PMIi

w

1

7

slide-23
SLIDE 23

Summing PMI vectors of a paraphrase

PMI1 + PMI2 ≈ PMI∗ ? PMI(w∗, cj) − ( PMI(w1,cj) + PMI(w2,cj) ) = log p(w∗|cj) p(w∗) − log p(w1|cj)p(w2|cj) p(w1)p(w2) + log p(W|cj) p(W|cj) + log p(W) p(W) = log p(cj|w∗) p(cj|W)

  • ρW,w∗

j

paraphrase error

+ log p(W|cj) p(w1|cj)p(w2|cj)

  • σW

j

conditional independence error

− log p(W) p(w1)p(w2)

  • τ W

independence error

Lemma 1: For any word w and word set , l: PMI

wi

PMIi

w

1

7

slide-24
SLIDE 24

Summing PMI vectors of a paraphrase

PMI1 + PMI2 ≈ PMI∗ ? PMI(w∗, cj) − ( PMI(w1,cj) + PMI(w2,cj) ) = log p(w∗|cj) p(w∗) − log p(w1|cj)p(w2|cj) p(w1)p(w2) + log p(W|cj) p(W|cj) + log p(W) p(W) = log p(cj|w∗) p(cj|W)

  • ρW,w∗

j

paraphrase error

+ log p(W|cj) p(w1|cj)p(w2|cj)

  • σW

j

conditional independence error

− log p(W) p(w1)p(w2)

  • τ W

independence error

Lemma 1: For any word w∗ ∈E and word set W ⊆E, |W|<l: PMI∗ = ∑

wi∈ W

PMIi + ρ

W,w∗ + σ W − τ W1

7

slide-25
SLIDE 25

Generalised Paraphrase (of W by W∗)

Lemma 1: For any word w∗ ∈E and word set W ⊆E, |W|<l: PMI∗ = ∑

wi∈ W

PMIi + ρ

W,w∗ + σ W − τ W1

Replace word w with word set :

w1 wn

p p

Lemma 2: For any word sets , , , l:

wi

PMIi

wi

PMIi 1

8

slide-26
SLIDE 26

Generalised Paraphrase (of W by W∗)

Lemma 1: For any word w∗ ∈E and word set W ⊆E, |W|<l: PMI∗ = ∑

wi∈ W

PMIi + ρ

W,w∗ + σ W − τ W1

Replace word w∗ with word set W∗ ⊆E:

w1 wn

E p(E|W∗) p(E|W)

Lemma 2: For any word sets , , , l:

wi

PMIi

wi

PMIi 1

8

slide-27
SLIDE 27

Generalised Paraphrase (of W by W∗)

Lemma 1: For any word w∗ ∈E and word set W ⊆E, |W|<l: PMI∗ = ∑

wi∈ W

PMIi + ρ

W,w∗ + σ W − τ W1

Replace word w∗ with word set W∗ ⊆E:

w1 wn

E p(E|W∗) p(E|W)

Lemma 2: For any word sets W, W∗ ⊆E, |W|, |W∗|<l: ∑

wi∈ W∗

PMIi = ∑

wi∈ W

PMIi + ρ

W,W∗ + σ W − σ W∗ − (τ W − τ W∗)1

8

slide-28
SLIDE 28

Paraphrase: the link from semantics to geometry

Lemma 2: For any word sets W, W∗ ⊆E, |W|, |W∗|<l: ∑

wi∈ W∗

PMIi = ∑

wi∈ W

PMIi + ρ

W,W∗ + σ W − σ W∗ − (τ W − τ W∗)1

So, if woman king paraphrases man queen then: PMIqueen PMIking PMIman PMIwoman 1

net dependence error 9

slide-29
SLIDE 29

Paraphrase: the link from semantics to geometry

Lemma 2: For any word sets W, W∗ ⊆E, |W|, |W∗|<l: ∑

wi∈ W∗

PMIi = ∑

wi∈ W

PMIi + ρ

W,W∗ + σ W − σ W∗ − (τ W − τ W∗)1

So, if W = {woman, king} paraphrases W∗ = {man, queen}, then: PMIqueen PMIking PMIman PMIwoman 1

net dependence error 9

slide-30
SLIDE 30

Paraphrase: the link from semantics to geometry

Lemma 2: For any word sets W, W∗ ⊆E, |W|, |W∗|<l: ∑

wi∈ W∗

PMIi = ∑

wi∈ W

PMIi + ρ

W,W∗ + σ W − σ W∗ − (τ W − τ W∗)1

So, if W = {woman, king} paraphrases W∗ = {man, queen}, then: PMIqueen ≈ PMIking − PMIman + PMIwoman + σ

W − σ W∗ − (τ W − τ W ∗)1

  • net dependence error

9

slide-31
SLIDE 31

Routemap

“man is to king as woman is to queen” ⇕ man transforms to king as woman transforms to queen ⇕ {woman, king} paraphrases {man, queen} ⇓

dependence error

PMIking − PMIman + PMIwoman ≈ PMIqueen ⇓

PMIi ≈ w⊤

i C

wking − wman + wwoman ≈ wqueen

geometric semantic

10

slide-32
SLIDE 32

Word Transformation: a change of perspective

A paraphrase w∗ of W can be thought of as a word transformation from some w∈W to w∗ by adding W+={wi ∈ W, wi ̸=w}, e.g. {man, royal} ≈p king = ⇒ man

+royal

− → king

Added words contextualise w, such that the induced distribution better aligns with that of w . Paraphrase by can be thought of as a word transformation from some w to some w by adding to both …

P

w w … or adding to one side and subtracting from the other:

P

w w

word transformation

11

slide-33
SLIDE 33

Word Transformation: a change of perspective

A paraphrase w∗ of W can be thought of as a word transformation from some w∈W to w∗ by adding W+={wi ∈ W, wi ̸=w}, e.g. {man, royal} ≈p king = ⇒ man

+royal

− → king

Added words contextualise w, such that the induced distribution better aligns with that of w∗. Paraphrase by can be thought of as a word transformation from some w to some w by adding to both …

P

w w … or adding to one side and subtracting from the other:

P

w w

word transformation

11

slide-34
SLIDE 34

Word Transformation: a change of perspective

A paraphrase w∗ of W can be thought of as a word transformation from some w∈W to w∗ by adding W+={wi ∈ W, wi ̸=w}, e.g. {man, royal} ≈p king = ⇒ man

+royal

− → king

Added words contextualise w, such that the induced distribution better aligns with that of w∗. Paraphrase W by W∗ can be thought of as a word transformation from some w∈W to some w∗ ∈W∗ by adding to both … W ≈P W∗ w w∗

+W+ +W−

… or adding to one side and subtracting from the other:

P

w w

word transformation

11

slide-35
SLIDE 35

Word Transformation: a change of perspective

A paraphrase w∗ of W can be thought of as a word transformation from some w∈W to w∗ by adding W+={wi ∈ W, wi ̸=w}, e.g. {man, royal} ≈p king = ⇒ man

+royal

− → king

Added words contextualise w, such that the induced distribution better aligns with that of w∗. Paraphrase W by W∗ can be thought of as a word transformation from some w∈W to some w∗ ∈W∗ by adding to both … W ≈P W∗ w w∗

+W+ +W−

… or adding to one side and subtracting from the other: W ≈P W∗ w w∗

+W+ −W−

word transformation

11

slide-36
SLIDE 36

Word Transformation: a change of perspective (cont.)

W ≈P W∗ w w∗

+W+ −W−

word transformation

A generalised paraphrase is a word transformation from w∈W to w∗ ∈W∗, where:

  • added words narrow context
  • subtracted words broaden context

Providing a “richer dictionary” to explain the difference between w and w , or rather, how “w is to w ”. Definition (D4): We say “wa is to wa as wb is to wb ” iff there exist that simultaneously transform wa to wa and wb to wb .

12

slide-37
SLIDE 37

Word Transformation: a change of perspective (cont.)

W ≈P W∗ w w∗

+W+ −W−

word transformation

A generalised paraphrase is a word transformation from w∈W to w∗ ∈W∗, where:

  • added words narrow context
  • subtracted words broaden context

Providing a “richer dictionary” to explain the difference between w and w∗, or rather, how “w is to w∗”. Definition (D4): We say “wa is to wa as wb is to wb ” iff there exist that simultaneously transform wa to wa and wb to wb .

12

slide-38
SLIDE 38

Word Transformation: a change of perspective (cont.)

W ≈P W∗ w w∗

+W+ −W−

word transformation

A generalised paraphrase is a word transformation from w∈W to w∗ ∈W∗, where:

  • added words narrow context
  • subtracted words broaden context

Providing a “richer dictionary” to explain the difference between w and w∗, or rather, how “w is to w∗”. Definition (D4): We say “wa is to wa∗ as wb is to wb∗” iff there exist W+, W− ⊆E that simultaneously transform wa to wa∗ and wb to wb∗.

12

slide-39
SLIDE 39

Word Transformation: a change of perspective (cont.)

That is, we say: “man is to king as woman is to queen” iff there exist W+, W− ⊆E that simultaneously transform man to king and woman to queen. That is:

P

man king

word transformation

P

woman queen

word transformation

Let king , man .

13

slide-40
SLIDE 40

Word Transformation: a change of perspective (cont.)

That is, we say: “man is to king as woman is to queen” iff there exist W+, W− ⊆E that simultaneously transform man to king and woman to queen. That is:

W ≈P W∗ man king

+W+ −W−

word transformation

W ≈P W∗ woman queen

+W+ −W−

word transformation

Let king , man .

13

slide-41
SLIDE 41

Word Transformation: a change of perspective (cont.)

That is, we say: “man is to king as woman is to queen” iff there exist W+, W− ⊆E that simultaneously transform man to king and woman to queen. That is:

W ≈P W∗ man king

+W+ −W−

word transformation

W ≈P W∗ woman queen

+W+ −W−

word transformation

Let W+ = {king}, W− = {man}.

13

slide-42
SLIDE 42

Word Transformation: a change of perspective (cont.)

That is, we say: “man is to king as woman is to queen” iff there exist W+, W− ⊆E that simultaneously transform man to king and woman to queen. That is:

W ≈P W∗ man king

+king −man

word transformation

W ≈P W∗ woman queen

+king −man

word transformation

Let W+ = {king}, W− = {man}.

14

slide-43
SLIDE 43

Routemap

“man is to king as woman is to queen” ⇕ man transforms to king as woman transforms to queen ⇕ {woman, king} paraphrases {man, queen} ⇓

dependence error

PMIking − PMIman + PMIwoman ≈ PMIqueen ⇓

PMIi ≈ w⊤

i C

wking − wman + wwoman ≈ wqueen

geometric semantic

15

slide-44
SLIDE 44

The Solution: linking semantics to geometry

“man is to king as woman is to queen” implies:

wking wman wwoman wqueen

16

slide-45
SLIDE 45

The Solution: linking semantics to geometry

“man is to king as woman is to queen” implies:

wking − wman + wwoman

ρ , σ ,τ

≈ wqueen

16

slide-46
SLIDE 46

The Solution: linking semantics to geometry

“man is to king as woman is to queen” implies:

wking − wman + wwoman

ρ , σ ,τ

≈ wqueen

woman king queen man permitting auxiliary royal crown sol reign princess lord prince wK − wM + wW

ρ σ τ

king queen prince wK − w

ρ σ τ

16

slide-47
SLIDE 47

References i

References

Alex Gittens, Dimitris Achlioptas, and Michael W Mahoney. Skip-gram-zipf+ uniform= vector additivity. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 69–76, 2017. Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix

  • factorization. In Advances in neural information processing systems, 2014.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013a. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 2013b.

17