Word Embedding CE-324: Modern Information Retrieval Sharif - - PowerPoint PPT Presentation

word embedding
SMART_READER_LITE
LIVE PREVIEW

Word Embedding CE-324: Modern Information Retrieval Sharif - - PowerPoint PPT Presentation

Word Embedding CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Many slides have been adopted from Socher lectures, cs224d, Stanford, 2017. One-hot coding 2 Distributed similarity based


slide-1
SLIDE 1

Word Embedding

CE-324: Modern Information Retrieval

Sharif University of Technology

  • M. Soleymani

Fall 2018 Many slides have been adopted from Socher lectures, cs224d, Stanford, 2017.

slide-2
SLIDE 2

One-hot coding

2

slide-3
SLIDE 3

Distributed similarity based representations

} representing a word by means of its neighbors } “You shall know a word by the company it keeps” (J. R.

Firth 1957: 11)

} One of the most successful ideas of modern statistical

NLP

3

slide-4
SLIDE 4

Word embedding

} Store “most” of the important information in a fixed,

small number of dimensions: a dense vector

} Usually around 25 – 1000 dimensions

} Embeddings: distributional models with dimensionality

reduction, based on prediction

4

slide-5
SLIDE 5

How to make neighbors represent words?

} Answer:With a co-occurrence matrix X

} options: full document vs windows

} Full word-document co-occurrence matrix

} will give general topics (all sports terms will have similar

entries) leading to “Latent Semantic Analysis”

} Window around each word

} captures both syntactic (POS) and semantic information

5

slide-6
SLIDE 6

LSA: Dimensionality Reduction based on word- doc matrix

words Docs Embedded words

Maintaining only the k largest singular values

  • f X

6

slide-7
SLIDE 7

Problems with SVD

} Its computational cost scales quadratically for n x m

matrix: O(mn2) flops (when n<m)

} Bad for millions of words or documents

} Hard to incorporate new words or documents } Does not consider order of words in the documents

7

slide-8
SLIDE 8

Directly learn low-dimensional word vectors

} Old idea. Relevant for this lecture:

} Learning

representations by back-propagating errors. (Rumelhart et al., 1986)

} NNLM: A neural probabilistic language model (Bengio et al.,

2003)

} NLP (almost) from Scratch (Collobert & Weston, 2008) } A recent, even simpler and faster model: word2vec (Mikolov et

  • al. 2013)-> intro now

8

slide-9
SLIDE 9

word2vec

} Key idea:The word vector can predict surrounding words } word2vec: as originally described (Mikolov et al 2013), a NN

model using a two-layer network (i.e., not deep!) to perform dimensionality reduction.

} Faster and can easily incorporate a new sentence/document

  • r add a word to the vocabulary

} Very computationally efficient, good all-round model (good

hyper-parameters already selected).

9

slide-10
SLIDE 10

Skip-gram vs. CBOW

} Two possible architectures:

} given some context words, predict the center (CBOW)

} Predict center word from sum of surrounding word vectors

} given a center word, predict the contexts (Skip-gram)

Skip-gram Continuous Bag of words (CBOW) CBOW uses a window of word to predict the middle word Skip-gram uses a word to predict the surrounding words. 10

slide-11
SLIDE 11

Continuous Bag of Word: Example

} E.g.“The cat sat on floor”

} Window size = 2

the cat

  • n

floor sat

11

slide-12
SLIDE 12

1 … 1 …

cat

  • n

1 …

Input layer Hidden layer sat Output layer

  • ne-hot

vector

  • ne-hot

vector Index of cat in vocabulary

Continuous Bag of Word: Example

12

slide-13
SLIDE 13

1 … 1 …

cat

  • n

1 …

Input layer Hidden layer sat Output layer

𝑋

"×$

𝑋

"×$

V-dim V-dim d-dim

𝑋′$×"

V-dim N will be the size of word vector

We must learn W and W

Continuous Bag of Word: Example

13

slide-14
SLIDE 14

Word embedding matrix

} You will get the word-vector by left multiplying a one-hot

vector by W

𝑦 = ⋮ 1 ⋮ (𝑦+ = 1) ℎ = 𝑦-𝑋 = 𝑋

+,. = 𝑤+

𝑙-th row of the matrix 𝑋

𝑋 = Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ

a Aardvark … zebra 14

slide-15
SLIDE 15

1 … 1 … cat

x

  • n

x

1 …

Input layer Hidden layer sat Output layer V-dim V-dim d-dim V-dim + 𝑤 2 = 𝑤345 + 𝑤78 2

4.8 4.5 5 … … … 2.1 0.5 8.4 2.5 … … … 5.6 … … … … … … … … … … … … … … 0.6 6.7 0.8 … … … 3.7

×

1 …

𝑋-×𝑦78 = 𝑤78

4.5 8.4 … … 6.7

=

Continuous Bag of Word: Example

15

slide-16
SLIDE 16

1 … 1 … cat

x

  • n

x

1 …

Input layer Hidden layer sat Output layer V-dim V-dim d-dim V-dim + 𝑤 2 = 𝑤345 + 𝑤78 2

4.8 4.5 5 1.5 … … … 2.1 0.5 8.4 2.5 0.9 … … … 5.6 … … … … … … … … … … … … … … … … 0.6 6.7 0.8 1.9 … … … 3.7

×

1 …

Continuous Bag of Word: Example

16

𝑋-×𝑦345 = 𝑤345

1.5 0.9 … … 1.9

=

slide-17
SLIDE 17

1 … 1 …

cat

  • n

1 …

Input layer Hidden layer 𝑧 2;<= Output layer

𝑋

"×$

𝑋

"×$

V-dim V-dim d-dim

𝑋

"×$ ?

×𝑤 2 = 𝑨

V-dim N will be the size of word vector 𝑤 2

𝑧 2 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑨)

Continuous Bag of Word: Example

17

slide-18
SLIDE 18

1 … 1 …

cat

  • n

1 …

Input layer Hidden layer 𝑧 2;<= Output layer

𝑋

"×$

𝑋

"×$

V-dim V-dim d-dim

𝑋

"×$ ?

×𝑤 2 = 𝑨 𝑧 2 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑨)

V-dim N will be the size of word vector 𝑤 2

0.01 0.02 0.00 0.02 0.01 0.02 0.01 0.7 … 0.00

𝑧 2 We would prefer 𝑧 2 close to 𝑧 2I45

Continuous Bag of Word: Example

18

slide-19
SLIDE 19

1 … 1 … cat

x

  • n

x

1 …

Input layer Hidden layer sat Output layer V-dim V-dim d-dim V-dim

𝑋

"×$

𝑋

"×$

4.8 4.5 5 1.5 … … … 2.1 0.5 8.4 2.5 0.9 … … … 5.6 … … … … … … … … … … … … … … … … 0.6 6.7 0.8 1.9 … … … 3.7

𝑋-

Contain word’s vectors

𝑋

$×" ?

We can consider either W or W’ as the word’s representation. Or even take the average.

Continuous Bag of Word: Example

19

slide-20
SLIDE 20

Skip-gram

} Embeddings that are good at predicting neighboring

words are also good at representing similarity

20

slide-21
SLIDE 21

Input layer Output layer

sat

x

𝑋

"×$

Hidden layer d-dim 𝑤 2

𝑋

$×" ?

𝑋

$×" ?

V-dim V-dim

1 … 1 …

cat

  • n

21

slide-22
SLIDE 22

Details of Word2Vec

} Learn to predict surrounding words in a window of length m of every word. } Objective function: Maximize the log probability of any context word given the

current center word: 𝐾 𝜄 = 1 𝑈 M M log 𝑞 𝑥5ST|𝑥5

  • WXYTYX

TZ[

  • 5\]

} Use a large training corpus to maximize it

T: training set size m: context size 𝑥

T : vector representation of the jth word

𝜄: whole parameters of the network

m is usually 5~10

22

slide-23
SLIDE 23

Skip-gram

} 𝑥7: context or output (outside) word } 𝑥^: center or input word

𝑄 𝑥7 𝑥^ = 𝑓ab

cde

∑ 𝑓ag

cdh

i

𝑡𝑑𝑝𝑠𝑓 𝑥7, 𝑥^ = ℎ-𝑋

.,7 ? = 𝑤^

  • 𝑣7

ℎ = 𝑦^

  • 𝑋 = 𝑋

^,. = 𝑤^

𝑋

.,7 ? = 𝑣7

Every word has 2 vectors 𝑤m: when 𝑥 is the center word 𝑣m: when 𝑥 is the outside word (context word)

𝑄 𝑥7 𝑥^ = 𝑓I37no pb,pe ∑ 𝑓I37no pq,pe

r

23

slide-24
SLIDE 24

Details of Word2Vec

} Predict surrounding words in a window of length m of

every word: 𝑄 𝑥7 𝑥^ = 𝑓ab

cde

∑ 𝑓aq

cde

r

𝐾 𝜄 = 1 𝑈 M M log 𝑞 𝑥5ST|𝑥5

  • WXYTYX

TZ[

  • 5\]

24

slide-25
SLIDE 25

Parameters

25

slide-26
SLIDE 26

Review: Iterative optimization of objective function

} Objective function: 𝐾(𝜾) } Optimization problem: 𝜾

t = argmax

𝜾

𝐾(𝜾)

} Steps:

} Start from 𝜾[ } Repeat

} Update 𝜾5 to 𝜾5S] in order to increase 𝐾 } 𝑢 ← 𝑢 + 1

} until we hopefully end up at a maximum 26

slide-27
SLIDE 27

Review: Gradient ascent

}

First-order optimization algorithm to find 𝜾 t = argmax

𝜾

𝐾(𝜾)

}

Also known as ”steepest ascent”

} In each step, takes steps proportional to the negative of the gradient vector

  • f the function at the current point 𝜾5:

}

𝐾(𝜾) increases fastest if one goes from 𝜾5 in the direction of 𝛼𝜾𝐾(𝜾5)

}

Assumption: 𝐾(𝜾) is defined and differentiable in a neighborhood of a point 𝜾5

27

slide-28
SLIDE 28

Review: Gradient ascent

} Maximize 𝐾(𝜾)

𝜾5S] = 𝜾5 + 𝜃𝛼𝜾𝐾(𝜾5)

𝛼𝜾𝐾 𝒙 = [𝜖𝐾 𝜾 𝜖𝜄] , 𝜖𝐾 𝜾 𝜖𝜄• , … , 𝜖𝐾 𝜾 𝜖𝜄$ ]

} If 𝜃 is small enough, then 𝐾 𝜾5S] ≥ 𝐾 𝜾5 . } 𝜃 can be allowed to change at every iteration as 𝜃5. Step size (Learning rate parameter) 28

slide-29
SLIDE 29

Gradient

𝜖 log 𝑞 𝑥7|𝑥^ 𝜖𝑤^ = 𝜖 𝜖𝑤^ log 𝑓ab

cde

∑ 𝑓aq

cde

r

= 𝜖 𝜖𝑤^ log 𝑓ab

cde − log M 𝑓aq cde

  • r

= 𝑣7 − 1 ∑ 𝑓aq

cde

r

M 𝑣r

  • r

𝑓aq

cde

= 𝑣7 − M 𝑞 𝑥r|𝑥^ 𝑣r

  • r

29

slide-30
SLIDE 30

Training difficulties

} With large vocabularies, it is not scalable!

𝜖 log 𝑞 𝑥7|𝑥^ 𝜖𝑤^ = 𝑣7 − M 𝑞 𝑥r|𝑥^ 𝑣r

" r\]

} Define negative prediction that only samples a few words

that do not appear in the context

} Similar to focusing on mostly positive correlations

30

slide-31
SLIDE 31

Word2vec: Negative Sampling

} Computing ∑

𝑞 𝑥r|𝑥^ 𝑣r

" r\]

is very time consuming.

} Main idea: train binary classifier for a true pair (center

word and word in its context window) and a couple of random pairs (the center word with a random word)

31

slide-32
SLIDE 32

Negative sampling

} k is the number of negative samples

log 𝜏 𝑣7

  • 𝑤^ +

M log 𝜏 −𝑣T

  • 𝑤^
  • p…~‡(p)

} Maximize probability that real outside word appears,

minimize prob. that random words appear around center word

} 𝑄(𝑥) = 𝑉 𝑥

‰ Š/𝑎

} the unigram distribution U(w) raised to the 3/4rd power. } The power makes less frequent words be sampled more often

Mikolov et al., Distributed Representations of Words and Phrases and their Compositionality, 2013.

32

slide-33
SLIDE 33

Example

Ghttp://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model

33

slide-34
SLIDE 34

What to do with the two sets of vectors?

} We end up with U and V from all the vectors u and v (in

columns)

} Both capture similar co-occurrence information.

} The best solution is to simply sum them up:

Xfinal = U + V

} One of many hyperparameters explored in GloVe

Pennington et al., Global Vectors for Word Representation, 2014.

34

slide-35
SLIDE 35

35

slide-36
SLIDE 36

Summary of word2vec

} Go through each word of the whole corpus } Predict surrounding words of each word } This captures co-occurrence of words one at a time } Why not capture co-occurrence counts directly?

36

slide-37
SLIDE 37

LSI vs. Skip-gram

Slide by M. Korniyenko, S. Samson http://www.sfs.uni-tuebingen.de/~ddekok/dl4nlp/glove-presentation.pdf 37

slide-38
SLIDE 38

LSI disadvantages

}

The co-occurrence matrix changes very often (new words are added).

} The matrix is extremely sparse since most words do not

co-occur.

} Quadratic cost to train. }

Requires the incorporation of some hacks on X to account for the drastic imbalance in word frequency.

} Iteration based methods solve many of these issues in a

far more elegant manner.

Slide by M. Korniyenko, S. Samson http://www.sfs.uni-tuebingen.de/~ddekok/dl4nlp/glove-presentation.pdf 38

slide-39
SLIDE 39

Main Idea of word2vec

} Instead

  • f

capturing co-occurrence counts directly, predict surrounding words of every word

} For

many tasks, word2vec (skip-gram)

  • utperforms

standard count-based vectors.

} But mainly due to the hyperparameters (see Levy et al)

39

slide-40
SLIDE 40

Window based co-occurrence matrix: Example

Window length 1 (more common: 5 - 10 Symmetric (irrelevant whether lel or right context Corpus

X

40

slide-41
SLIDE 41

More about Word2Vec – relation to LSA

} LSA factorizes a matrix of co-occurrence counts v } (Levy and Goldberg 2014) proves that skip-gram model

implicitly factorizes a (shifted) PMI matrix!

41

slide-42
SLIDE 42

GloVe

Pennington et al., Global Vectors for Word Representation, 2014.

𝐾 𝜄 = M 𝑔 𝑌iT 𝑣i

  • 𝑤T − log 𝑌iT

i,T

42

slide-43
SLIDE 43

How to evaluate word vectors?

} Related to general evaluation in NLP: Intrinsic vs extrinsic } Intrinsic:

} Evaluation on a specific/intermediate subtask } Fast to compute } Helps to understand that system } Not clear if really helpful unless correlation to real task is established

} Extrinsic:

} Evaluation on a real task } Can take a long time to compute accuracy } Unclear if the subsystem is the problem or its interaction or other

subsystems

} If replacing exactly one subsystem with another improves accuracy -

> Winning!

43

slide-44
SLIDE 44

Intrinsic evaluation: Word analogy tasks

} Performance in completing analogy tasks:

} Analogy queries } Example:“man is to woman as king is to — ?”

𝑒 = argmax

i

𝑦• − 𝑦4 + 𝑦3 -𝑦i 𝑦• − 𝑦4 + 𝑦3

} Evaluate word vectors by how well their cosine distance after

addition captures intuitive semantic and syntactic analogy questions

} Discarding the input words from the search! } Problem:What if the information is there but not linear?

Mikolov et al., Distributed Representations of Words and Phrases and their Compositionality, 2013.

𝑦• − 𝑦4 ≈ 𝑦$ − 𝑦3 𝑒∗ = argmax

i

𝑡𝑗𝑛(𝑦• − 𝑦4, 𝑦i − 𝑦3) 𝑦• − 𝑦4 + 𝑦3 ≈ 𝑦$ 𝑒∗ = argmax

i

𝑡𝑗𝑛(𝑦• − 𝑦4 + 𝑦3, 𝑦i)

44

slide-45
SLIDE 45

Word Analogies

The linearity of the skip-gram model makes its vectors more suitable for such linear analogical reasoning

45

slide-46
SLIDE 46

Visualizations

46

slide-47
SLIDE 47

GloV Visualizations: Company - CEO

47

slide-48
SLIDE 48

Glov Visualizations: Superlatives

48

slide-49
SLIDE 49

Other fun word2vec analogies

49

slide-50
SLIDE 50

Analogy evaluation and hyperparameters

Pennington et al., Global Vectors for Word Representation, 2014.

50

slide-51
SLIDE 51

Analogy evaluation and hyperparameters

} More data helps,Wikipedia is better than news text!

Pennington et al., Global Vectors for Word Representation, 2014.

51

slide-52
SLIDE 52

Another intrinsic word vector evaluation

52

slide-53
SLIDE 53

Closest words to “Sweden” (cosine similarity)

53

slide-54
SLIDE 54

Extrinsic word vector evaluation

} Extrinsic evaluation of word vectors: All subsequent NLP

tasks can be considered as down stream task

} One example where good word vectors should help

directly: text classification

54

slide-55
SLIDE 55

Word vectors: advantages

} It captures both syntactic (POS) and semantic information } It scales

} Train on billion word corpora in limited time

} Can easily incorporate a new sentence/ document or add

a word to the vocabulary

} Word embeddings trained by one can be used by others. } There is a nice Python module for word2vec

} Gensim (word2vec:

http://radimrehurek.com/2014/02/word2vec-tutorial/)

55

slide-56
SLIDE 56

Resources

} Mikolov et al., Distributed Representations of Words and

Phrases and their Compositionality, 2013.

} Pennington

et al., Global Vectors for Word Representation, 2014.

56