Neural Networks for Natural Language Processing
Tomas Mikolov, Facebook Brno University of Technology, 2017
Neural Networks for Natural Language Processing Tomas Mikolov, - - PowerPoint PPT Presentation
Neural Networks for Natural Language Processing Tomas Mikolov, Facebook Brno University of Technology, 2017 Introduction Text processing is the core business of internet companies today (Google, Facebook, Yahoo, ) Machine learning
Tomas Mikolov, Facebook Brno University of Technology, 2017
Facebook, Yahoo, β¦)
to big datasets to improve many tasks:
Neural Networks for NLP, Tomas Mikolov 2
Artificial neural networks are applied to many language problems:
Beyond artificial neural networks:
Neural Networks for NLP, Tomas Mikolov 3
representations
Neural Networks for NLP, Tomas Mikolov 4
π π = ΰ·
π
π(π₯π|π₯1 β¦ π₯πβ1)
π π = ΰ·
π
π(π₯π|π₯πβ2 β¦ π₯πβ1)
Neural Networks for NLP, Tomas Mikolov 5
π "π’βππ‘ ππ‘ π π‘πππ’ππππ" = π π’βππ‘ Γ π(ππ‘|π’βππ‘) Γ π π π’βππ‘, ππ‘ Γ π(π‘πππ’ππππ|ππ‘, π)
π π π’βππ‘, ππ‘ = π·(π’βππ‘ ππ‘ π) π·(π’βππ‘ ππ‘)
zero probabilities) A Bit of Progress in Language Modeling (Goodman, 2001)
Neural Networks for NLP, Tomas Mikolov 6
Example: vocabulary = (Monday, Tuesday, is, a, today) Monday = [1 0 0 0 0] Tuesday = [0 1 0 0 0] is = [0 0 1 0 0] a = [0 0 0 1 0] today = [0 0 0 0 1] Also known as 1-of-N (where in our case, N would be the size of the vocabulary)
Neural Networks for NLP, Tomas Mikolov 7
Example: vocabulary = (Monday, Tuesday, is, a, today) Monday Monday = [2 0 0 0 0] today is a Monday = [1 0 1 1 1] today is a Tuesday = [0 1 1 1 1] is a Monday today = [1 0 1 1 1] Can be extended to bag-of-N-grams to capture local ordering of words
Neural Networks for NLP, Tomas Mikolov 8
generalization
π·πππ‘π‘1 = π§πππππ₯, ππ πππ, πππ£π, π ππ π·πππ‘π‘2 = (π½π’πππ§, π»ππ ππππ§, πΊπ ππππ, πππππ)
words share the same class)
Neural Networks for NLP, Tomas Mikolov 9
assumed that similar words appear in similar contexts
modeling tasks, we can use also counts of classes, which leads to generalization (better performance on novel data) Class-based n-gram models of natural language (Brown, 1992)
Neural Networks for NLP, Tomas Mikolov 10
Main statistical tools for NLP:
Neural Networks for NLP, Tomas Mikolov 11
regularization
Neural Networks for NLP, Tomas Mikolov 12
techniques than using plain counting
techniques completely fail at
gain in accuracy counts!
Neural Networks for NLP, Tomas Mikolov 13
Neural Networks for NLP, Tomas Mikolov 14
Neural Networks for NLP, Tomas Mikolov
Input synapses
15
Neural Networks for NLP, Tomas Mikolov
Input synapses w1 w2 w3 W: input weights
16
Neural Networks for NLP, Tomas Mikolov
Input synapses w1 w2 w3 W: input weights Activation function: max(0, value) Neuron with non-linear activation function
17
Neural Networks for NLP, Tomas Mikolov
Input synapses w1 w2 w3 W: input weights Activation function: max(0, value) Neuron with non-linear activation function Output (axon)
18
Neural Networks for NLP, Tomas Mikolov
Input synapses w1 w2 w3 W: input weights Activation function: max(0, value) I: input signal ππ£π’ππ£π’ = max(0, π½ β π) Neuron with non-linear activation function Output (axon) i1 i3 i2
19
the biological neurons (those communicate by sending spike signals at various frequencies)
projections of data (and not as a model of brain)
Neural Networks for NLP, Tomas Mikolov 20
Neural Networks for NLP, Tomas Mikolov
INPUT LAYER HIDDEN LAYER OUTPUT LAYER
21
compute gradient of the error
the same weights that were used in the forward pass Simplified graphical representation:
Neural Networks for NLP, Tomas Mikolov
INPUT LAYER HIDDEN LAYER OUTPUT LAYER
22
Choice of the hyper-parameters has to be done manually:
It may seem complicated at first, the best way to start is to re-use some existing setup and try your own modifications.
Neural Networks for NLP, Tomas Mikolov 23
(hidden layers) in the model
efficiently with shallow models
bits at input, output is 1 if the number of active input bits is odd) (Perceptrons, Minsky & Papert 1969)
Neural Networks for NLP, Tomas Mikolov 24
simpler functions, it may be beneficial to use deep architecture
Neural Networks for NLP, Tomas Mikolov
INPUT LAYER HIDDEN LAYER 1 HIDDEN LAYER 2 HIDDEN LAYER 3 OUTPUT LAYER
25
than a shallow (one hidden layer) model can learn: beware the hype!
Neural Networks for NLP, Tomas Mikolov 26
Neural Networks for NLP, Tomas Mikolov 27
matrix of weights between the input and hidden layers)
Neural Networks for NLP, Tomas Mikolov
CURRENT WORD HIDDEN LAYER NEXT WORD
28
word vectors (also known as word embeddings)
space (usually N = 50 β 1000)
words have similar vector representations)
Neural Networks for NLP, Tomas Mikolov 29
NLP tasks (Collobert et al, 2011)
generalization for systems trained with limited amount of supervised data
Neural Networks for NLP, Tomas Mikolov 30
vectors, usually using several hidden layers
different architectures
Neural Networks for NLP, Tomas Mikolov 31
tense, plurality, even semantic concepts like βcapital city ofβ)
βking β man + womanβ and obtain βqueenβ Linguistic regularities in continuous space word representations (Mikolov et al, 2013)
Neural Networks for NLP, Tomas Mikolov 32
Word-based dataset, almost 20K questions, focuses on both syntax and semantics:
Oslo: ___
Iran: ___
grandson: ___
swimming: ___
Efficient estimation of word representations in vector space (Mikolov et al, 2013)
Neural Networks for NLP, Tomas Mikolov 33
Phrase-based dataset, focuses on semantics:
Larry Page: ___
Distributed Representations of Words and Phrases and their Compositionality (Mikolov et al, 2013)
Neural Networks for NLP, Tomas Mikolov 34
language model (Bengio et al, 2003)
Neural Networks for NLP, Tomas Mikolov 35
adding more context without adding the hidden layer!
Neural Networks for NLP, Tomas Mikolov
CURRENT WORD HIDDEN LAYER NEXT WORD
36
adds inputs from words within short window to predict the current word
n-gram NNLM of (Bengio, 2003)
Neural Networks for NLP, Tomas Mikolov 37
current word
Neural Networks for NLP, Tomas Mikolov 38
can easily be in order of millions (too many outputs to evaluate):
Neural Networks for NLP, Tomas Mikolov 39
during training
Neural Networks for NLP, Tomas Mikolov 40
Neural Networks for NLP, Tomas Mikolov
41
choice of the technique itself
computational complexity
been published as word2vec project: https://code.google.com/p/word2vec/
Neural Networks for NLP, Tomas Mikolov 42
Neural Networks for NLP, Tomas Mikolov 43
Neural Networks for NLP, Tomas Mikolov 44
Neural Networks for NLP, Tomas Mikolov 45
be deep
scalable text classifier (βfastTextβ):
faster to train on large datasets
Neural Networks for NLP, Tomas Mikolov 46
understand language
Neural Networks for NLP, Tomas Mikolov 47
early 90βs (Elman, Jordan, Mozer, Hopfield, Parallel Distributed Processing group, β¦)
computation (usually over time)
Neural Networks for NLP, Tomas Mikolov 48
connections, and the output layer
to represent unlimited memory
(Finding structure in time, Elman 1990)
Neural Networks for NLP, Tomas Mikolov 49
mainstream research
considered as unstable to be trained
Memory RNN architecture, but this model was too complex for others to reproduce easily
Neural Networks for NLP, Tomas Mikolov 50
art in language modeling, machine translation, data compression and speech recognition (including strong commercial speech recognizer from IBM)
results and extend the techniques (used at Microsoft Research, Google, IBM, Facebook, Yandex, β¦)
instability of training
Neural Networks for NLP, Tomas Mikolov 51
Neural Networks for NLP, Tomas Mikolov 52
Neural Networks for NLP, Tomas Mikolov 53
Neural Networks for NLP, Tomas Mikolov 54
implementations in general ML toolkits:
thousands of hidden neurons)
Neural Networks for NLP, Tomas Mikolov 55
becoming unnecessarily complex
(easier to get a paper published and attract attention)
Neural Networks for NLP, Tomas Mikolov 56
efficiently?
variable-length sequence of symbols
Neural Networks for NLP, Tomas Mikolov 57
language (or in any Turing-complete computational system)
addition of numbers learned from examples
techniques
Neural Networks for NLP, Tomas Mikolov 58
which ones should we focus on?
achieve it step-by-step
Neural Networks for NLP, Tomas Mikolov 59
Tomas Mikolov, Armand Joulin and Marco Baroni
Can do almost anything:
require hundreds of years of work to solve)
Neural Networks for NLP, Tomas Mikolov 61
machine will consist of
Neural Networks for NLP, Tomas Mikolov 62
modify itself to adapt to new problems
Neural Networks for NLP, Tomas Mikolov 63
To build and develop intelligent machines, we need:
learning strategies
Neural Networks for NLP, Tomas Mikolov 64
machine communication skills
growing, seems essential for success:
machine can fail to learn
Neural Networks for NLP, Tomas Mikolov 65
Simulated environment:
Scaling up:
Neural Networks for NLP, Tomas Mikolov 66
for the learner, represents the world
signal and produces output signal to maximize average incoming reward
replaced by human users
Neural Networks for NLP, Tomas Mikolov 67
and can be used for communication with the Teacher
Neural Networks for NLP, Tomas Mikolov 68
Neural Networks for NLP, Tomas Mikolov 69
today that will βsolveβ simple tasks in the simulated world using a myriad of trials, but this will not scale to complex problems
through few tasks should be enough for it to generalize to similar tasks later
rewards
Neural Networks for NLP, Tomas Mikolov 70
experts (us) who will teach it novel behavior
used by human non-experts
Neural Networks for NLP, Tomas Mikolov 71
Neural Networks for NLP, Tomas Mikolov 72
Certain trivial patterns are nowadays hard to learn:
Stack-Augmented Recurrent Nets
Neural Networks for NLP, Tomas Mikolov 73
To hope the machine can scale to more complex problems, we need:
http://arxiv.org/abs/1511.08130
Neural Networks for NLP, Tomas Mikolov 74
neural net learns to control
exceeding what was shown before (though still very toyish): much less supervision, scales to more complex tasks
Neural Networks for NLP, Tomas Mikolov 75
stacks, lists, queues, tapes, grids, β¦
Neural Networks for NLP, Tomas Mikolov 76
(grammars)
example sequences
Neural Networks for NLP, Tomas Mikolov 77
Neural Networks for NLP, Tomas Mikolov 78
LSTM
Neural Networks for NLP, Tomas Mikolov 79
Neural Networks for NLP, Tomas Mikolov 80
The good:
The bad:
stored there yet)
Neural Networks for NLP, Tomas Mikolov 81
To achieve true artificial intelligence, we need:
Neural Networks for NLP, Tomas Mikolov 82