Institute of Computational Perception
Natural Language Processing with Deep Learning Word Embeddings
Navid Rekab-Saz
navid.rekabsaz@jku.at
Natural Language Processing with Deep Learning Word Embeddings - - PowerPoint PPT Presentation
Natural Language Processing with Deep Learning Word Embeddings Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Agenda Introduction Count-based word representation Prediction-based word embedding
Institute of Computational Perception
Navid Rekab-Saz
navid.rekabsaz@jku.at
4
5
Word Embedding Model π€1 π€2 π€π π
When vector representations are dense, they are often called embedding e.g. word embedding
Word embeddings projected to a two-dimensional space
7
frogs toad litoria leptodactylidae rana
books foreword author published preface
bronchitis allergy allergies arthritis diabetes
https://nlp.stanford.edu/projects/glove/
8
https://nlp.stanford.edu/projects/glove/
10
linguistic theory 1930β1955 (1957)
11
Nida[1975]
12
13
Algorithmic intuition:
14
Β§ πΌ is a set of documents (plays of Shakespeare) πΌ = [π1, π2, β¦ , ππ] Β§ π is the set of words (vocabularies) in dictionary π = [π€1, π€2, β¦ , π€π] Β§ Words as rows and documents as columns Β§ Values: term count tc!,# Β§ Matrix size πΓπ π1 π2 π3 π4
As You Like It Twelfth Night Julius Caesar Henry V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 ... ... β¦ ... ...
15
/
/
Β§ π = 1 1 0 π = 4 5 6 cos(π, π) = 1 β 4 + 1 β 5 + 0 β 6 1$ + 1$ + 0$ 4$ + 5$ + 6$ = 9 ~12.4
16
π1 π2 π3 π4
As You Like It Twelfth Night Julius Caesar Henry V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 ... ... β¦ ... ...
17
β = [π1, π2, β¦ , ππ]
18
sugar, a sliced lemon, a tablespoonful of apricot preserve or jam, a pinch each of, their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened well suited to programming on the digital computer. In finding the optimal R-stage policy from for the purpose of gathering data and information necessary for the study authorized in the
π1 π2 π3 π4 π5 π6
aardvark computer data pinch result sugar
π€1 apricot
1 1
π€2 pineapple
1 1
π€3 digital
2 1 1
π€4 information
1 6 4
19
π1 π2 π3 π4 π5 π6
aardvark computer data pinch result sugar
π€1 apricot
1 1
π€2 pineapple
1 1
π€3 digital
2 1 1
π€4 information
1 6 4
20
donβt contain much of information
matrix in order to reflect the informativeness of co-occurrences
their marginal probabilities
21
π π€, π = # π€, π π π π€ = β%&'
(
# π€, π% π π π = β)&'
*
# π€), π π π = H
)&' *
H
%&' (
# π€), π%
22
π π€ = information, π = data = Q 6 19 = .32 π π€ = information = Q 11 19 = .58 π π = data = Q 7 19 = .37 PPMI π€ = information, π = data = max(0, log .32 .58 β .37) = .39 π1 π2 π3 π4 π5 π6
aardvark computer data pinch result sugar
π€1 apricot
1 1
π€2 pineapple
1 1
π€3 digital
2 1 1
π€4 information
1 6 4
24
* The definition of SVD is simplified. Refer to https://en.wikipedia.org/wiki/Singular_value_decomposition for the exact definition
25
πΓπ πΓπ πΓπ πΓπ =
π eigenvalues π― right singular vectors πΎπ left singular vectors π½
26
πΓπ words contexts = πΓπ πΓπ πΓπ
word vectors π½ eigenvalues π― context vectors πΎπ (sparse) word-context matrix π
27
28
Γπ πΓπ πΓπ
truncated word vectors
: π½
truncated eigenvalues % π― truncated context word vectors
: πΎπ
π π π π π π
30
Recipe for creating (dense) word embedding with neural networks Details come next!
31
Window of size 2
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
32
https://web.stanford.edu/~jurafsky/slp3/
Train sample: (TesgΓΌino, drink)
Linear activation
Encoder embedding Decoder embedding Input Layer (One-hot encoder) Output Layer (softmax)
1Γπ 1Γπ 1Γπ
πΓπ
π(ππ¬π£π¨π₯|ππππ‘ΓΌπ£π¨π©)
Forward pass Backpropagation
33
Embedding vector Ale TesgΓΌino
34
Embedding vector Ale TesgΓΌino
35
Embedding vector Decoding vector Ale TesgΓΌino
36
drink Ale TesgΓΌino Embedding vector Decoding vector
37
drink Ale TesgΓΌino Embedding vector Decoding vector
38
drink
Ale TesgΓΌino Embedding vector Decoding vector
39
A
?βπ exp π@π Μ ? A
A
?βπ exp πA,1(ΓΌ'%"π Μ ? A
Denominator (normalization) is expensive!
40
m
kβπ exp πjπ Μ k m
m β log ? Μ kβπ
k m
41
πΓπ
42
m)
43
44
π represents a randomly created corpus β words co-occur randomly!
sample?
π is smoothed by raising unigram counts to the power of π½ = 0.75 β Context Distribution Smoothing
45
Β§ The objective function
π) Β§ Loss function:
Μ ?~I π J K'#,1
A β
Μ ?~I π J K'#,1
? A positive samples negative samples
46
drink TesgΓΌino
Embedding vector Decoding vector
47
drink TesgΓΌino
Embedding vector Decoding vector
48
drink TesgΓΌino
π) Embedding vector Decoding vector
49
is an unbiased approximation of softmax