Word Embedding
CE-324: Modern Information Retrieval
Sharif University of Technology
- M. Soleymani
Fall 2018 Many slides have been adopted from Socher lectures, cs224d, Stanford, 2017.
Word Embedding CE-324: Modern Information Retrieval Sharif - - PowerPoint PPT Presentation
Word Embedding CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Many slides have been adopted from Socher lectures, cs224d, Stanford, 2017. One-hot coding 2 Distributed similarity based
Fall 2018 Many slides have been adopted from Socher lectures, cs224d, Stanford, 2017.
2
3
4
5
words Docs Embedded words
Maintaining only the k largest singular values
6
7
8
9
} Predict center word from sum of surrounding word vectors
Skip-gram Continuous Bag of words (CBOW) CBOW uses a window of word to predict the middle word Skip-gram uses a word to predict the surrounding words. 10
the cat
floor sat
11
1 … 1 …
cat
1 …
Input layer Hidden layer sat Output layer
vector
vector Index of cat in vocabulary
12
1 … 1 …
cat
1 …
Input layer Hidden layer sat Output layer
𝑋
"×$
𝑋
"×$
V-dim V-dim d-dim
𝑋′$×"
V-dim N will be the size of word vector
’
We must learn W and W
13
+,. = 𝑤+
𝑋 =
a Aardvark … zebra 14
1 … 1 … cat
x
x
1 …
Input layer Hidden layer sat Output layer V-dim V-dim d-dim V-dim + 𝑤 2 = 𝑤345 + 𝑤78 2
4.8 4.5 5 … … … 2.1 0.5 8.4 2.5 … … … 5.6 … … … … … … … … … … … … … … 0.6 6.7 0.8 … … … 3.7
×
1 …
𝑋-×𝑦78 = 𝑤78
4.5 8.4 … … 6.7
=
15
1 … 1 … cat
x
x
1 …
Input layer Hidden layer sat Output layer V-dim V-dim d-dim V-dim + 𝑤 2 = 𝑤345 + 𝑤78 2
4.8 4.5 5 1.5 … … … 2.1 0.5 8.4 2.5 0.9 … … … 5.6 … … … … … … … … … … … … … … … … 0.6 6.7 0.8 1.9 … … … 3.7
×
1 …
16
𝑋-×𝑦345 = 𝑤345
1.5 0.9 … … 1.9
=
1 … 1 …
cat
1 …
Input layer Hidden layer 𝑧 2;<= Output layer
𝑋
"×$
𝑋
"×$
V-dim V-dim d-dim
𝑋
"×$ ?
×𝑤 2 = 𝑨
V-dim N will be the size of word vector 𝑤 2
𝑧 2 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑨)
17
1 … 1 …
cat
1 …
Input layer Hidden layer 𝑧 2;<= Output layer
𝑋
"×$
𝑋
"×$
V-dim V-dim d-dim
𝑋
"×$ ?
×𝑤 2 = 𝑨 𝑧 2 = 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑨)
V-dim N will be the size of word vector 𝑤 2
0.01 0.02 0.00 0.02 0.01 0.02 0.01 0.7 … 0.00
𝑧 2 We would prefer 𝑧 2 close to 𝑧 2I45
18
1 … 1 … cat
x
x
1 …
Input layer Hidden layer sat Output layer V-dim V-dim d-dim V-dim
𝑋
"×$
𝑋
"×$
4.8 4.5 5 1.5 … … … 2.1 0.5 8.4 2.5 0.9 … … … 5.6 … … … … … … … … … … … … … … … … 0.6 6.7 0.8 1.9 … … … 3.7
𝑋-
Contain word’s vectors
𝑋
$×" ?
We can consider either W or W’ as the word’s representation. Or even take the average.
19
20
Input layer Output layer
sat
x
𝑋
"×$
Hidden layer d-dim 𝑤 2
𝑋
$×" ?
𝑋
$×" ?
V-dim V-dim
1 … 1 …
cat
21
} Learn to predict surrounding words in a window of length m of every word. } Objective function: Maximize the log probability of any context word given the
TZ[
} Use a large training corpus to maximize it
T: training set size m: context size 𝑥
T : vector representation of the jth word
𝜄: whole parameters of the network
22
cde
cdh
i
.,7 ? = 𝑤^
^,. = 𝑤^
.,7 ? = 𝑣7
Every word has 2 vectors 𝑤m: when 𝑥 is the center word 𝑣m: when 𝑥 is the outside word (context word)
r
23
cde
cde
24
25
} Update 𝜾5 to 𝜾5S] in order to increase 𝐾 } 𝑢 ← 𝑢 + 1
} until we hopefully end up at a maximum 26
}
𝜾
}
Also known as ”steepest ascent”
} In each step, takes steps proportional to the negative of the gradient vector
}
}
27
𝛼𝜾𝐾 𝒙 = [𝜖𝐾 𝜾 𝜖𝜄] , 𝜖𝐾 𝜾 𝜖𝜄• , … , 𝜖𝐾 𝜾 𝜖𝜄$ ]
} If 𝜃 is small enough, then 𝐾 𝜾5S] ≥ 𝐾 𝜾5 . } 𝜃 can be allowed to change at every iteration as 𝜃5. Step size (Learning rate parameter) 28
cde
cde
cde − log M 𝑓aq cde
cde
cde
29
30
31
‰ Š/𝑎
Mikolov et al., Distributed Representations of Words and Phrases and their Compositionality, 2013.
32
Ghttp://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model
33
Pennington et al., Global Vectors for Word Representation, 2014.
34
35
36
Slide by M. Korniyenko, S. Samson http://www.sfs.uni-tuebingen.de/~ddekok/dl4nlp/glove-presentation.pdf 37
Slide by M. Korniyenko, S. Samson http://www.sfs.uni-tuebingen.de/~ddekok/dl4nlp/glove-presentation.pdf 38
39
Window length 1 (more common: 5 - 10 Symmetric (irrelevant whether lel or right context Corpus
40
41
Pennington et al., Global Vectors for Word Representation, 2014.
i,T
42
43
} Analogy queries } Example:“man is to woman as king is to — ?”
Mikolov et al., Distributed Representations of Words and Phrases and their Compositionality, 2013.
𝑦• − 𝑦4 ≈ 𝑦$ − 𝑦3 𝑒∗ = argmax
i
𝑡𝑗𝑛(𝑦• − 𝑦4, 𝑦i − 𝑦3) 𝑦• − 𝑦4 + 𝑦3 ≈ 𝑦$ 𝑒∗ = argmax
i
𝑡𝑗𝑛(𝑦• − 𝑦4 + 𝑦3, 𝑦i)
44
The linearity of the skip-gram model makes its vectors more suitable for such linear analogical reasoning
45
46
47
48
49
Pennington et al., Global Vectors for Word Representation, 2014.
50
Pennington et al., Global Vectors for Word Representation, 2014.
51
52
53
54
55
56