Neural Networks Language Models
Philipp Koehn 16 April 2015
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
Neural Networks Language Models Philipp Koehn 16 April 2015 - - PowerPoint PPT Presentation
Neural Networks Language Models Philipp Koehn 16 April 2015 Philipp Koehn Machine Translation: Neural Networks 16 April 2015 N-Gram Backoff Language Model 1 Previously, we approximated p ( W ) = p ( w 1 , w 2 , ..., w n ) ... by
Philipp Koehn 16 April 2015
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
1
p(W) = p(w1, w2, ..., wn)
p(W) =
p(wi|w1, ..., wi−1)
p(wi|w1, ..., wi−1) ≃ p(wi|wi−4, wi−3, wi−2, wi−1)
→ we back off to p(wi|wi−3, wi−2, wi−1), p(wi|wi−2, wi−1), etc., all the way to p(wi) – exact details of backing off get complicated — ”interpolated Kneser-Ney”
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
2
⇒ We are wrestling here with – using as much relevant evidence as possible – pooling evidence between words
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
3
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
4
– dog = (0,0,0,0,1,0,0,0,0,....) – cat = (0,0,0,0,0,0,0,1,0,....) – eat = (0,1,0,0,0,0,0,0,0,....)
– limit to, say, 20,000 most frequent words, rest are OTHER – place words in √n classes, so each word is represented by ∗ 1 class label ∗ 1 word in class label
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
5
– sort words by frequency – place them in order into classes – each class has same token count → very frequent words have their own class → rare words share class with many other words
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
6
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
7
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
8
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
9
– input→embedding: none – embedding→hidden: tanh – hidden→output: softmax
– loop through the entire corpus – update between predicted probabilities and 1-hot vector for output word
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
10
Word Embedding
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
11
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
12
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
13
– adjectives base form vs. comparative, e.g., good, better – nouns singular vs. plural, e.g., year, years – verbs present tense vs. past tense, e.g., see, saw
– clothing is to shirt as dish is to bowl – evaluated on human judgment data of semantic similarities
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
14
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
15
– n-best list – lattice
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
16
Precomputed
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
17
Precomputed Can be cached
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
18
Precomputed Only compute score for predicted word
4x30x100 weights 100 nodes 4x30 nodes 1,000,000 nodes 100x1,000,000 100x1 weights
Can be cached
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
19
– compute scores for all possible words – add them up – normalize (softmax)
– we do not care — a score is a score (Auli and Gao, 2014) – training regime that normalizes (Vaswani et al, 2013) – integrate normalization into objective function (Devlin et al., 2014)
– first predict class, normalize – then predict word, normalize → compute 2√n instead of n output node values
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
20
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
21
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
22
copy values
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
23
copy values
copy values
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
24
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
25
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
26
update through the unfolded recurrent neural network
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
27
– 5 time steps seems to be sufficient – network learns to store information for more than 5 time steps
– process 10-20 training examples – update backwards through all examples – removes need for multiple steps for each training example
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
28
⇒ very bad for dynamic programming
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
29
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
30
– prediction at that time step – impact on future time steps
⇒ We want the proper balance
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
31
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
32
– input word embedding x – previous hidden layer values h(t−1) – weight matrices W and U
j wijxj + j uijh(t−1) j
Philipp Koehn Machine Translation: Neural Networks 16 April 2015
33
each with their own weight matrices: WI, UI, WO, UO, WF, UF
yI
i = sigmoid( j wI ijxj + j uI ijh(t−1) j
) yF
i = sigmoid( j wF ijxj + j uF ijh(t−1) j
)
˜ C(t)
i
= tanh(
j wC ijxj + j uC ijh(t−1) j
)
C(t)
i
= yI
i ˜
C(t)
i
+ yF
i C(t−1)
yO
i = sigmoid( j wO ijxj + j uO ijh(t−1) j
) +
j vijC(t) j )
h(t) = yO
i tanh(C(t)) Philipp Koehn Machine Translation: Neural Networks 16 April 2015