Neural Networks Language Models
Philipp Koehn 1 October 2020
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Neural Networks Language Models Philipp Koehn 1 October 2020 - - PowerPoint PPT Presentation
Neural Networks Language Models Philipp Koehn 1 October 2020 Philipp Koehn Machine Translation: Neural Networks 1 October 2020 N-Gram Backoff Language Model 1 Previously, we approximated p ( W ) = p ( w 1 , w 2 , ..., w n ) ... by
Philipp Koehn 1 October 2020
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
1
p(W) = p(w1, w2, ..., wn)
p(W) =
p(wi|w1, ..., wi−1)
p(wi|w1, ..., wi−1) ≃ p(wi|wi−4, wi−3, wi−2, wi−1)
→ we back off to p(wi|wi−3, wi−2, wi−1), p(wi|wi−2, wi−1), etc., all the way to p(wi) – exact details of backing off get complicated — ”interpolated Kneser-Ney”
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
2
⇒ We are wrestling here with – using as much relevant evidence as possible – pooling evidence between words
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
3
Softmax FF
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
4
– dog = (0,0,0,0,1,0,0,0,0,....) – cat = (0,0,0,0,0,0,0,1,0,....) – eat = (0,1,0,0,0,0,0,0,0,....)
– limit to, say, 20,000 most frequent words, rest are OTHER – place words in √n classes, so each word is represented by ∗ 1 class label ∗ 1 word in class label
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
5
– sort words by frequency – place them in order into classes – each class has same token count → very frequent words have their own class → rare words share class with many other words
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
6
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
7
Softmax FF
Embed Embed Embed Embed
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
8
– input→embedding: none – embedding→hidden: tanh – hidden→output: softmax
– loop through the entire corpus – update between predicted probabilities and 1-hot vector for output word
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
9
Word Embedding
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
10
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
11
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
12
– adjectives base form vs. comparative, e.g., good, better – nouns singular vs. plural, e.g., year, years – verbs present tense vs. past tense, e.g., see, saw
– clothing is to shirt as dish is to bowl – evaluated on human judgment data of semantic similarities
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
13
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
14
Hidden Layer Output Word History Embedding
Softmax tanh
w1
Embed
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
15
Hidden Layer Output Word History Embedding
Softmax tanh
w1
Embed Softmax tanh
w2
Embed
copy
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
16
Hidden Layer Output Word History Embedding
Softmax tanh
w1
Embed Softmax tanh
w2
Embed Softmax tanh
w3
Embed
copy copy
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
17
Softmax RNN
w1
Embed
Hidden Layer Output Word History Embedding
Cost
yt ht Ewt wt
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
18
RNN Softmax RNN
w2
Embed
Hidden Layer Output Word History Embedding
Cost
yt ht Ewt wt
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
19
Softmax RNN
w1
Embed Softmax RNN
w2
Embed
Hidden Layer Output Word History Embedding
Softmax RNN
w3
Embed Cost Cost Cost
yt ht Ewt wt
update through the unfolded recurrent neural network
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
20
– 5 time steps seems to be sufficient – network learns to store information for more than 5 time steps
– process 10-20 training examples – update backwards through all examples – removes need for multiple steps for each training example
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
21
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
22
– prediction at that time step – impact on future time steps
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
23
– memory of the network – continuous space representation used to predict output words
After much economic progress over the years, the country → has
The country which has made much economic progress over the years still → has
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
24
– similar to a node in a hidden layer – but: has a explicit memory state
– input gate: how much new input changes memory state – forget gate: how much of prior memory state is retained – output gate: how strongly memory state is passed on to next layer.
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
25
input gate
forget gate
X i m
h m
LSTM Layer Time t-1 Next Layer Y LSTM Layer Time t Preceding Layer
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
26
memoryt = gateinput × inputt + gateforget × memoryt−1
ht = f(outputt)
– given node values for prior layer xt = (xt
1, ..., xt X)
– given values for hidden layer from previous time step ht−1 = (ht−1
1
, ..., ht−1
H )
– input value is combination of matrix multiplication with weights wx and wh and activation function g inputt = g X
wx
i xt i + H
wh
i ht−1 i
Machine Translation: Neural Networks 1 October 2020
27
→ with a neural network layer!
– weight matrix W xa to consider node values in previous layer xt – weight matrix W ha to consider hidden layer ht−1 at previous time step – weight matrix W ma to consider memory at previous time step
– activation function h gatea = h X
wxa
i xt i + H
wha
i ht−1 i
+
H
wma
i
memoryt−1
i
Machine Translation: Neural Networks 1 October 2020
28
– all the operations are still based on ∗ matrix multiplications ∗ differentiable activation functions → we can compute gradients for objective function with respect to all parameters → we can compute update functions
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
29
(from Tran, Bisazza, Monz, 2016)
input = 0, gatei memory = 1)
⇒ can remember important features over long time span (capture long distance dependencies)
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
30
Karpathy et al. (2015): ”Visualizing and Understanding Recurrent Networks”
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
31
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
32
update gate reset gate
X x
h h
GRU Layer Time t-1 Next Layer Y GRU Layer Time t Preceding Layer
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
33
updatet = g(Wupdate inputt + Uupdate statet−1 + biasupdate) resett = g(Wreset inputt + Ureset statet−1 + biasreset)
(similar to traditional recurrent neural network) combinationt = f(W inputt + U(resett ◦ statet−1))
statet =(1 − updatet) ◦ statet−1 + updatet
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
34
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
35
Output Hidden Layer Input Word Embedding
Softmax RNN Softmax RNN Softmax RNN Embed Embed Embed
yt ht E xt Shallow
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
36
Softmax RNN RNN Softmax RNN RNN Softmax RNN RNN RNN RNN RNN Embed Embed Embed
yt ht,3 ht,2 ht,1 E xi
Softmax RNN RNN Softmax RNN RNN Softmax RNN RNN RNN RNN RNN
Output Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Input Word Embedding
Embed Embed Embed
yt ht,3 ht,2 ht,1 E xi Deep Stacked Deep Transitional
Philipp Koehn Machine Translation: Neural Networks 1 October 2020
37
Philipp Koehn Machine Translation: Neural Networks 1 October 2020