Neural Language Models
CMSC 723 / LING 723 / INST 725 MARINE CARPUAT
marine@cs.umd.edu
With slides from Graham Neubig and Philipp Koehn
Neural Language Models CMSC 723 / LING 723 / INST 725 M ARINE C - - PowerPoint PPT Presentation
Neural Language Models CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu With slides from Graham Neubig and Philipp Koehn Roadmap Modeling Sequences First example: language model What are n-gram models? How to
CMSC 723 / LING 723 / INST 725 MARINE CARPUAT
marine@cs.umd.edu
With slides from Graham Neubig and Philipp Koehn
– Assign higher probability to “real” or “frequently observed” sentences
I always order pizza with cheese and ____ The 33rd President of the US was ____ I saw a ____
mushrooms 0.1 pepperoni 0.1 anchovies 0.01 …. fried rice 0.0001 …. and 1e-100
Perplexity is the inverse probability of the test set, normalized by the number of words: Chain rule: For bigrams: Minimizing perplexity is the same as maximizing probability
The best language model is one that best predicts an unseen test set
PP(W) = P(w1w2...wN )
N
= 1 P(w1w2...wN )
N
Aside
sign
𝑗=1 𝐽
𝑥𝑗 ⋅ ϕ𝑗 𝑦
φ“A” = 1 φ“site” = 1 φ“,” = 2 φ“located” = 1 φ“in” = 1 φ“Maizuru”= 1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0
2
O X O X O X
O X O X O X
X O O X
φ“A” = 1 φ“site” = 1 φ“,” = 2 φ“located” = 1 φ“in” = 1 φ“Maizuru”= 1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0
φ“A” = 1 φ“site” = 1 φ“,” = 2 φ“located” = 1 φ“in” = 1 φ“Maizuru”= 1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0
X O O X
φ0(x2) = {1, 1} φ0(x1) = {-1, 1} φ0(x4) = {1, -1} φ0(x3) = {-1, -1}
sign sign
φ0[0] φ0[1] 1 1 1
φ0[0] φ0[1]
φ1[0] φ0[0] φ0[1] 1
w0,0 b0,0
φ1[1]
w0,1 b0,1
X O O X
φ0(x2) = {1, 1} φ0(x1) = {-1, 1} φ0(x4) = {1, -1} φ0(x3) = {-1, -1}
1 1
φ1 φ2
φ1[1] φ1[0]
φ1[0] φ1[1]
φ1(x1) = {-1, -1}
X O
φ1(x2) = {1, -1}
O
φ1(x3) = {-1, 1} φ1(x4) = {-1, -1}
X O O X
φ0(x2) = {1, 1} φ0(x1) = {-1, 1} φ0(x4) = {1, -1} φ0(x3) = {-1, -1}
1 1
φ0[0] φ0[1]
φ1[1] φ1[0] φ1[0] φ1[1] φ1(x1) = {-1, -1} X O φ1(x2) = {1, -1} O φ1(x3) = {-1, 1} φ1(x4) = {-1, -1}
1 1 1
φ2[0] = y
tanh tanh
φ0[0] φ0[1] 1 φ0[0] φ0[1] 1 1 1
1 1 1 1
tanh
φ1[0] φ1[1] φ2[0]
24
𝑧 𝑓𝐱⋅ϕ 𝑦, 𝑧
Current class Sum of other classes
𝑠∈𝐬
Online training algorithm for probabilistic models w = 0 for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw In other words
(the direction that will increase the probability of y)
10 0.1 0.2 0.3 0.4 w*phi(x) dp(y|x)/dw*phi(x)
𝑒 𝑒𝑥 𝑄 𝑧 = 1 ∣ 𝑦 = 𝑒 𝑒𝑥 𝑓𝐱⋅ϕ 𝑦 1 + 𝑓𝐱⋅ϕ 𝑦 = ϕ 𝑦 𝑓𝐱⋅ϕ 𝑦 1 + 𝑓𝐱⋅ϕ 𝑦
2
𝑒 𝑒𝑥 𝑄 𝑧 = −1 ∣ 𝑦 = 𝑒 𝑒𝑥 1 − 𝑓𝐱⋅ϕ 𝑦 1 + 𝑓𝐱⋅ϕ 𝑦 = −ϕ 𝑦 𝑓𝐱⋅ϕ 𝑦 1 + 𝑓𝐱⋅ϕ 𝑦
2
y=1
ϕ 𝑦 𝑒𝑄 𝑧 = 1 ∣ 𝐲 𝑒𝐱𝟓 = 𝐢 𝑦 𝑓𝐱𝟓⋅𝐢 𝑦 1 + 𝑓𝐱𝟓⋅𝐢 𝑦
2
𝐢 𝑦 𝑒𝑄 𝑧 = 1 ∣ 𝐲 𝑒𝐱𝟐 = ? 𝑒𝑄 𝑧 = 1 ∣ 𝐲 𝑒𝐱𝟑 = ? 𝑒𝑄 𝑧 = 1 ∣ 𝐲 𝑒𝐱𝟒 = ?
w1 w2 w3 w4
𝑒𝑄 𝑧 = 1 ∣ 𝑦 𝑒𝐱𝟐 = 𝑒𝑄 𝑧 = 1 ∣ 𝑦 𝑒𝐱𝟓𝐢 𝐲 𝑒𝐱𝟓𝐢 𝐲 𝑒ℎ1 𝐲 𝑒ℎ1 𝐲 𝑒𝐱𝟐 𝑓𝐱𝟓⋅𝐢 𝑦 1 + 𝑓𝐱𝟓⋅𝐢 𝑦
2
𝑥1,4
Error of next unit (δ4) Weight Gradient of this unit
𝑒𝑄 𝑧 = 1 ∣ 𝐲 𝐱𝐣 = 𝑒ℎ𝑗 𝐲 𝑒𝐱𝐣
𝑘
δ𝑘 𝑥𝑗,𝑘
In General Calculate i based
y
ϕ 𝑦
– Vector/matrix operations + non-linearities
For more details, see Cho chap 3 or CIML Chap 7
Aside
dog = [ 0, 0, 0, 0, 1, 0, 0, 0 …] cat = [ 0, 0, 0, 0, 0, 0, 1, 0 …] eat = [ 0, 1, 0, 0, 0, 0, 1, 0 …]
– limit to most frequent words (e.g., top 20000) – cluster words into classes
Map each word into a lower-dimensional real-valued space using shared weight matrix C Embedding layer Bengio et al. 2003
a by product
similar embeddings
[Turian et al. 2009]
y
ϕ𝐮 𝑦 𝐢𝐮−𝟐
After processing a few training examples, Update through the unfolded recurrent neural network
– Memory of the network – Continuous space representation to predict output words
– Long Short Term Memory – Gated Recurrent Units
– speech recognition (Mikolov et al. 2011) – and more recently machine translation (Devlin et al. 2014)
– softmax requires normalizing over sum of scores for all possible words – What to do?