CS11-747 Neural Networks for NLP
Why is word2vec so fast? Efficiency tricks for neural nets
Taylor Berg-Kirkpatrick
Site https://phontron.com/class/nn4nlp2017/
Why is word2vec so fast? Efficiency tricks for neural nets Taylor - - PowerPoint PPT Presentation
CS11-747 Neural Networks for NLP Why is word2vec so fast? Efficiency tricks for neural nets Taylor Berg-Kirkpatrick Site https://phontron.com/class/nn4nlp2017/ Glamorous Life of an AI Scientist Perception Reality Waiting. Photo Credit:
CS11-747 Neural Networks for NLP
Taylor Berg-Kirkpatrick
Site https://phontron.com/class/nn4nlp2017/
Perception Reality
Photo Credit: Antoine Miech @ Twitter
Waiting….
vocabularies
Correct Value Negative Samples
exponent and normalizing (softmax) This is expensive, would like to approximate
˜ xi es(˜ xi|hi)
˜ xi
xi|hi)
cannot calculate exactly
(uniform/unigram), then re-weight with e^s/Q to approximate denominator
Z(hi) ≈ 1 N X
˜ xi∼Q(·|hi)
es(˜
xi|hi)
Q(˜ xi | hi)
P(d = 1 | xi, hi) = P(xi | hi) P(xi | hi) + N ∗ Q(xi | hi)
EP [log P(d = 1 | xi, hi)] + N ∗ EQ[log P(d = 0 | xi, hi)]
(set = 0) ˜ P(xi | hi) = P(xi | hi)/echi chi
examples, calculate the log probabilities
uniform P(d = 1 | xi, hi) = P(xi | hi) P(xi | hi) + 1
expensive, especially on the GPU
samples for each minibatch
be more efficiently calculable
decision at every node
0 1 1 1 0 → word 14
1 1 1 ↓ word 14
Hybrid Model Error Correcting Codes
Model parallelism
Data parallelism
Thread 1 Thread 2 Thread 3 Thread 4
different GPU device
memory movement?
tanh( ) σ( ) * Thread 3 Thread 4 Thread 2 Thread 1
keep parameters fresh across machines? this is an example this is another example this is the best example no, i’m the best example Thread 1 Thread 2 Thread 3 Thread 4
Quick to start, top speed not shabby Takes forever to get off the ground, but super-fast
CPU, like a motorcycle GPU, like an airplane
Image Credit: Wikipedia
run many more experiments
NLP analysis tasks with small or complicated data/networks
for x in words_in_sentence: vals.append( W * c + x )
Bad
W_c = W * c for x in words_in_sentence: vals.append( W_c + x )
Good
multiplies into a single matrix-matrix multiply? Do so!
for x in words_in_sentence: vals.append( W * x ) val = dy.concatenate(vals)
Bad
X = dy.concatenate_cols(words_in_sentence) val = W * X
Good
possible (GPU operations are asynchronous)
Bad
for x in words_in_sentence: # input data for x # do processing # input data for whole sentence for x in words_in_sentence: # do processing
Good
major issue
to minimize memory movement)