A fast and simple algorithm for training neural probabilistic language models
Andriy Mnih Joint work with Yee Whye Teh
Gatsby Computational Neuroscience Unit University College London
25 January 2013
1 / 22
A fast and simple algorithm for training neural probabilistic - - PowerPoint PPT Presentation
A fast and simple algorithm for training neural probabilistic language models Andriy Mnih Joint work with Yee Whye Teh Gatsby Computational Neuroscience Unit University College London 25 January 2013 1 / 22 Statistical language modelling
1 / 22
◮ speech recognition ◮ machine translation ◮ information retrieval
◮ The distribution of the next word depends on only a fixed number of
◮ Though false, makes the task much more tractable without making
2 / 22
◮ Estimated by counting the number of occurrences of each word
◮ Smoothing is essential for good performance.
◮ The number of model parameters is exponential in the context size. ◮ Cannot take advantage of large contexts. 3 / 22
◮ Words are represented with real-valued feature vectors learned
◮ A neural network maps a context (a sequence of word feature
◮ Word feature vectors and neural net parameters are learned jointly.
4 / 22
,
in for −
with by ( at about after when if before
just more_than against like did under made ? between through
including make get do until without left at_least near take around see got to_do
nearly give keep across put took pay does
along making held such_as up_to despite within received gave behind hit leave include almost showed ! came estimated seen working doing done was_in taking appeared following included taken come caused cause inside based_on worked below saw hold all_of paid bring brought
5 / 22
θ(w) =
w′ exp(sθ(w′, h)) is the normalizer for context h.
◮ ˆ
n−1
◮ The scoring function is sθ(w, h) = ˆ
6 / 22
θ(w) = ∂
θ(w′) ∂
∂ ∂θ log Zθ(h) is expensive: the time complexity is linear in
◮ Sample words from a proposal distribution Qh(x) and reweight the
k
Qh(x)
j=1 v(xj). ◮ Stability issues: need either a lot of samples or an adaptive
7 / 22
8 / 22
θ (x):
◮ Set Pθ(x) = Pu
θ (x)/Z and learn Z (or log Z).
∂ ∂θ log Pθ(x) are always between 0 and 1.
◮ Can use far fewer noise samples as a result. 9 / 22
◮ This is a pointwise reweighting of the ML gradient.
10 / 22
◮ One distribution per context. ◮ These distributions share parameters.
θ(w) is the probability of word w in context h under the model, the
d
θ(w)
θ(w) + kPn(w)
θ(w) + kPn(w)
h
11 / 22
cd+k times faster than the ML update.
◮ c is the context size ◮ d is the representation dimensionality ◮ v is the vocabulary size ◮ k is the number of noise samples
c+k .
12 / 22
◮ For large context sizes and datasets the number of such
◮ Fortunately, learning works just as well if the normalizing
◮ Use several noise samples per datapoint. ◮ Generate new noise samples before each parameter update. 13 / 22
◮ Training set: 930K words ◮ Validation set: 74K words ◮ Test set: 82K words ◮ Vocabulary: 10K words
◮ Perplexity is the geometric average of
1 P(w|h).
◮ The perplexity of a uniform distribution over N values is N. 14 / 22
15 / 22
16 / 22
◮ Test set: 1,040 sentences from five Sherlock Holmes novels ◮ Training data: ◮ 522 19th-century novels from Project Gutenberg (48M words)
◮ Random guessing gives 20% accuracy. 17 / 22
18 / 22
19 / 22
◮ Diagonal context matrices for better scalability w.r.t word
◮ Separate representation tables for context words and the next word.
◮ Use a special “out-of-sentence” token for words in context positions
◮ Estimated ML training time: 1-2 months. 20 / 22
21 / 22
◮ Over an order of magnitude faster than maximum-likelihood
◮ Very stable even when using one noise sample per datapoint. ◮ Models trained using NCE with 25 noise samples per datapoint
22 / 22