SLIDE 1
Traitement automatique des langues : Fondements et applications
Cours 10 : Neural networks (1) Tim Van de Cruys & Philippe Muller 2016—2017
SLIDE 2 Introduction
Machine learning for NLP
- Standard approach: linear model trained over high-dimensional
but very sparse feature vectors
- Recently: non-linear neural networks over dense input vectors
SLIDE 3 Neural Network Architectures
Feed-forward neural networks
- Best known, standard neural network approach
- Fully connected layers
- Can be used as drop-in replacement for typical NLP classifiers
SLIDE 4 Feature representation
Dense vs. one hot
- One hot: each feature is its own dimension
- Dimensionality vector is same as number of features
- Each feature is completely independent from one another
- Dense: each feature is a d-dimensional vector
- Dimensionality is d
- Similar features have similar vectors
SLIDE 5 Feature representation
Feature combinations
- Traditional NLP: specify interactions of features
- E.g. features like ’word is jump, tag is V and previous word is
they’
- Non-linear network: only specify core features
- Non-linearity of network takes care of finding indicative feature
combinations
SLIDE 6 Feature representation
Why dense?
- Discrete approach often works surprisingly well for NLP tasks
- n-gram language models
- POS-tagging, parsing
- sentiment analysis
- Still, a very poor representation of word meaning
- No notion of similarity
- Limited inference
SLIDE 7 Feature representation
Why dense?
- Discrete approach often works surprisingly well for NLP tasks
- n-gram language models
- POS-tagging, parsing
- sentiment analysis
- Still, a very poor representation of word meaning
- No notion of similarity
- Limited inference
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 0 0 0 0 ]
SLIDE 8 Feature representation
Why dense?
- Discrete approach often works surprisingly well for NLP tasks
- n-gram language models
- POS-tagging, parsing
- sentiment analysis
- Still, a very poor representation of word meaning
- No notion of similarity
- Limited inference
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 0 0 0 0 ] [ 0 0 0 0 0 01 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
SLIDE 9
Feature representation
Why dense?
SLIDE 10
Feature representation
Why dense?
SLIDE 11
Feature representation
Why dense?
SLIDE 12
Feed-forward
Architecture
Multi-layer perceptron with 2 hidden layers NNMLP2(x) = y (1) h1 = g(xW1 + b1) (2) h2 = g(h1W2 + b2) (3) y = h2W3 (4) x: vector of size din = 3 y: vector of size dout = 2 h1, h2: vectors of size dhidden = 4
SLIDE 13
Feed-forward
Architecture
Multi-layer perceptron with 2 hidden layers NNMLP2(x) = y (1) h1 = g(xW1 + b1) (2) h2 = g(h1W2 + b2) (3) y = h2W3 (4) W1, W2, W3: matrices of size [3 × 4], [4 × 4], [4 × 2] b1, b2: ’bias’ vectors of size dhidden = 4 g(·): non-linear activation function (elementwise)
SLIDE 14
Feed-forward
Architecture
Multi-layer perceptron with 2 hidden layers NNMLP2(x) = y (1) h1 = g(xW1 + b1) (2) h2 = g(h1W2 + b2) (3) y = h2W3 (4) W1, W2, W3, b1, b2 = parameters of the network (θ) Use of multiple hidden layers: deep learning
SLIDE 15 Feed-forward
Non-linear activation functions
Sigmoid (logistic) function σ(x) =
1 1+e−x
0.5 1 −6 −4 −2 2 4 6
SLIDE 16
Feed-forward
Non-linear activation functions
Hyperbolic tangent (tanh) function tanh(x) = e2x−1
e2x+1
SLIDE 17
Feed-forward
Non-linear activation functions
Rectified linear unit (ReLU) ReLU(x) = max(0, x)
SLIDE 18 Feed-forward
Output transformation function
Softmax function x = x1 . . . xk softmax(xi) =
exi k
j=1 exj
SLIDE 19 Feed-forward
Input vector
- Embedding lookup from embedding matrix
- concatenate or sum embeddings
SLIDE 20 Feed-forward
Loss functions
y, y) - the loss of predicting ˆ y when true output is y
- Set parameters θ in order to minimize loss across different
training examples
- Compute gradient of parameters with regard to loss function
to find minimum, take steps in right direction
SLIDE 21 Feed-forward
Loss functions
- Hinge loss (binary and multi-class)
- classify correct class over incorrect class(es) with margin of at
least 1
- Categorical cross-entropy loss (negative log-likelihood)
- Measure difference between true class distribution y and
predicted class distribution ˆ y
- Use with softmax output
- Ranking loss
- In unsupervised setting: rank attested examples over
unattested, corrupted ones with margin of at least 1
SLIDE 22 Training
Stochastic gradient descent
- Goal: minimize total loss n
i=1 L(f(xi; θ), yi)
- Estimating gradient over entire training set before taking step is
computationally heavy
- Compute gradient for small batch of samples from training set
→ estimate of gradient: stochastic
- Learning rate λ: size of step in right direction
- Improvements: momentum, adaptive learning rate
SLIDE 23 Training
Stochastic gradient descent
- Size mini-batch: balance between better estimate and faster
convergence
- Gradients over different parameters (weight matrices, bias
terms, embeddings, ...) efficiently calculated using backpropagation algorithm
- No need to carry out derivations yourself: automatic tools for
gradient computation using computational graphs
SLIDE 24 Training
Initialization
- Parameters of network are initialized randomly
- Magnitude of random samples has effect on training success
- effective initialization schemes exist
SLIDE 25 Training
Misc
- Shuffling: shuffle training set with each epoch
- Learning rate: balance between proper convergence and fast
convergence
- Minibatch: balance speed/proper estimate; efficient using GPU
SLIDE 26 Training
Regularization
- Neural networks have many parameters: risk of overfitting
- Solution: regularization
- L2: extend loss function with squared penalty on parameters,
i.e. λ
2 ||θ||2
- Dropout: Randomly dropping (setting to zero) half of the
neurons in the network for each training sample
SLIDE 27 Word embeddings
- Each word i is represented by a small, dense vector vi ∈ Rd
- d is typically in the range 50–1000
- Matrix of size V (vocabulary size) × d (embedding size)
- words are ‘embedded’ in a real-valued, low-dimensional space
- Similar words have similar embeddings
SLIDE 28 Word embeddings
- Each word i is represented by a small, dense vector vi ∈ Rd
- d is typically in the range 50–1000
- Matrix of size V (vocabulary size) × d (embedding size)
- words are ‘embedded’ in a real-valued, low-dimensional space
- Similar words have similar embeddings
d1 d2 d3 . . . apple –2.34 –1.01 0.33 pear –2.28 –1.20 0.11 car –0.20 1.02 2.44 . . .
SLIDE 29 Word embeddings
- Each word i is represented by a small, dense vector vi ∈ Rd
- d is typically in the range 50–1000
- Matrix of size V (vocabulary size) × d (embedding size)
- words are ‘embedded’ in a real-valued, low-dimensional space
- Similar words have similar embeddings
SLIDE 30 Neural word embeddings
- Word embeddings have been around for quite some time
- The term ‘embedding’ was coined within the neural network
community, along with new methods to learn them
- Idea: Let’s allocate a number of parameters for each word and
allow the neural network to automatically learn what the useful values should be
- Prediction-based: learn to predict the next word
SLIDE 31 Embeddings through language modeling
- Predict the next word in a
sequence, based on the previous word
- One non-linear hidden layer,
- ne softmax layer for
classification
- Choose parameters that
- ptimize probability of
correct word
SLIDE 32 Embeddings through language modeling
- Predict the next word in a
sequence, based on the previous word
- One non-linear hidden layer,
- ne softmax layer for
classification
- Choose parameters that
- ptimize probability of
correct word
SLIDE 33 Embeddings through error detection
- Take a correct sentence and
create a corrupted counterpart
- Train the network to assign a
higher score to the correct version of each sentence
SLIDE 34 Embeddings through error detection
- Take a correct sentence and
create a corrupted counterpart
- Train the network to assign a
higher score to the correct version of each sentence
SLIDE 35 Word2vec
- Neural network approaches work well, but large number of
parameters makes them computationally heavy
- Popular, light-weight approach with less parameters:
word2vec
- No hidden layer, only softmax classifier
- Two different models
- Continuous bag of words (CBOW): predict current word based
- n surrounding words
- Skip-gram: predict surrounding words based on context words
SLIDE 36 CBOW
- Current word wt is predicted
from context words
- Prediction is made from the
sum of context embeddings
SLIDE 37 Skip-gram
predicted from current word
softmax classifier are shared
SLIDE 38 Negative sampling
softmax classifier is still rather expensive
correct context, and a number of wrong contexts
contexts and minimize wrong ones
SLIDE 39
Word similarity
france j´ esus rouge peinture grande-bretagne christ jaune sculpture belgique j´ esus-christ bleu gravure espagne mo¨ ıse vert dessin angleterre dieu blanc photographie pologne proph` ete noir peintures bretagne r´ esurrection bleue toile su` ede mahomet gris c´ eramique italie abraham verte po´ esie roumanie saints blanche peintre tunisie anges noire art
SLIDE 40
Word similarity
SLIDE 41
Word similarity
SLIDE 42
Word similarity
SLIDE 43
Word similarity
SLIDE 44
Word similarity
SLIDE 45 Computing analogies
a is to b as c is to d
- The system is given a, b, and c; it needs to compute d. For
example: apple is to apples as car is to ? man is to woman as king is to ?
SLIDE 46 Computing analogies
- task: a is to b as c is to d
- idea: direction of the
relation should remain the same a − b ≈ c − d man − woman ≈ king − queen queen ≈ king − man + woman dw = arg max
d′
w∈V
cos(d′, c − a + b)
SLIDE 47
Computing analogies
SLIDE 48
Computing analogies
SLIDE 49
Computing analogies
SLIDE 50
Computing analogies
SLIDE 51
Computing analogies
SLIDE 52
Computing analogies
Relationship Example 1 Example 2 Example 3 France - Paris Italy: Rome Japan: Tokyo Florida: Tallahassee big - bigger small: larger cold: colder quick: quicker Sarkozy - France Berlusconi: Italy Merkel: Germany Koizumi: Japan copper - Cu zinc: Zn gold: Au uranium: plutonium Japan - sushi Germany: bratwurst France: tapas USA: pizza
SLIDE 53
Computing analogies
a is to b as c is to d homme roi : femme reine autriche vienne : allemagne berlin ´ ecrivain livre : po` ete po` eme france nicolas sarkozy : ´ etats- unis bush droite ump : gauche ps droite front national : gauche pcf
SLIDE 54 Prediction-based vs. count-based
- Count-based: extract co-occurrence frequencies from corpus
- Prediction-based: induce parameters based on prediction of
correct word
- Initially seen as two different methods, where prediction-based
approach works better
- Later, emerging evidence and proof that they are actually
equivalent or quite similar
SLIDE 55 Word Embeddings
unsupervised pre-training: parameters
- Training objective
- Choice of contexts (bag of words, syntax, parallel corpus)
- Window size (small, sentence, paragraph), dynamic, ...
- Directed
- Lemmatization, normalization, stop words, ...
- Character-based, sub-word models
SLIDE 56 Word embeddings in practice
- Word embeddings are often used for pretraining
- Unsupervised: only requires plain text, so can be trained on a
lot of data
- Fast algorithms available
- It helps a model start from an informed position
- Model is initialized with pretrained word embeddings, and then
finetuned depending on task
SLIDE 57
Word embeddings in practice
SLIDE 58 Beyond words
- Embeddings are not restricted to words
- Can equally be computed for sentences, paragraphs,
documents
- Different methods
- word2vec-alike method (paragraph vector)
- Convolutional neural network
- Recurrent neural network (sequence-to-sequence learning)
- Current research trend in NLP with applications in machine
translation, paraphrase detection, etc.
SLIDE 59 Neural Network Architectures
Convolutional neural networks
- Type of feedforward neural network
- Certain layers are not fully connected (convolutional layers,
pooling layers)
- local cues appear in different places in input (cfr. vision)
SLIDE 60 Neural Network Architectures
Recurrent (+ recursive) neural networks
- Handle structured data of arbitrary sizes
- Recurrent networks for sequences
- Recursive networks for trees
SLIDE 61 Convolutional layers
Introduction
How to represent variable number of features, e.g. words in a sentence, document?
- Continuous Bag of Words (CBOW): sum embedding vectors of
corresponding features
- no ordering info (”not good quite bad” = ”not bad quite good”)
- Convolutional layer
- ’Sliding window’ approach that takes local structure into account
- Combine individual windows to create vector of fixed size
SLIDE 62 Feature representation
Variable number of features
- Feed-forward network assume fixed dimensional input
- How to represent variable number of features, e.g. words in a
sentence, document?
- Continuous Bag of Words (CBOW): sum embedding vectors of
corresponding features
SLIDE 63 Convolutional layers
Basic convolution + pooling
- Goal: identify indicative local features (n-grams) in large
structure, combine them into fixed size vector
- Convolution: apply filter to each window (linear transformation
+ non-linear activation)
- Pooling: combine by taking maximum
SLIDE 64 Convolutional layers
Narrow vs. wide convolution
- Whether or not to include padding elements to beginning and
end
- E.g. ”the quick brown fox”, convolution of size 2
- Narrow: [the quick], [quick brown], [brown fox]
- Wide: [ the], [the quick], [quick brown], [brown fox], [fox ]
SLIDE 65 Convolutional layers
Variations
- Dynamic: divide sentence/document in different regions, do
pooling over each region
- Hierarchical: succession of convolution and pooling layers
- K-max pooling: Keep top-k values when pooling
SLIDE 66 Recurrent neural network
Introduction
- CBOW: no ordering, no structure
- CNN: improvement, but mostly local patterns
- RNN: represent arbitrarily sized structured input as fixed-size
vectors, paying attention to structured properties
SLIDE 67 Recurrent neural network
Model
- x1: input layer (current word)
- a1: hidden layer of current timestep
- a0: hidden layer of previous timestep
- U, W and V: weights matrices
- f(·): element-wise activation function (sigmoid)
- g(·): softmax function to ensure probability distribution
a1 = f(Ux1 + Wa0) (5) y1 = g(Va1) (6)
SLIDE 68
Recurrent neural network
Graphical representation
SLIDE 69 Recurrent neural network
Training
- Consider recurrent neural network as very deep neural network
with shared parameters across computation
- Backpropagation through time
- What kind of supervision?
- Acceptor: based on final state
- Transducer: an output for each input (e.g. language modeling)
- Encoder-decoder: one RNN to encode sequence into vector
representation, another RNN to decode into sequence (e.g. machine translation)
SLIDE 70
Recurrent neural network
Training: graphical representation
SLIDE 71 Recurrent neural network
Multi-layer RNN
- multiple layers of RNNs
- input of next layer is output of RNN layer below it
- Empirically shown to work better
SLIDE 72 Recurrent neural network
Bi-directional RNN
- Input sequence both forward and backward to different RNNs
- Representation is concatenation of forward and backward state
(A & A’)
- Represent both history and future
SLIDE 73
Concrete RNN architectures
Simple RNN
SLIDE 74 Concrete RNN architectures
LSTM
- Long short term memory networks
- In practice, simple RNNs only able to remember narrow context
(vanishing gradient)
- LSTM: complex architecture able to capture long-term
dependencies
SLIDE 75
Concrete RNN architectures
LSTM
SLIDE 76
Concrete RNN architectures
LSTM
SLIDE 77
Concrete RNN architectures
LSTM
SLIDE 78
Concrete RNN architectures
LSTM
SLIDE 79
Concrete RNN architectures
LSTM
SLIDE 80
Concrete RNN architectures
LSTM
SLIDE 81 Concrete RNN architectures
GRU
- LSTM: effective, but complex, computationally expensive
- GRU: cheaper alternative that works well in practice
SLIDE 82 Concrete RNN architectures
GRU
- reset gate (r): how much information from previous hidden
state needs to be included (reset with current information?)
- upgate gate (z): controls updates to hidden state (how much
does hidden state need to be updated with current information?)
SLIDE 83 Recursive neural networks
Introduction
- Generalization of RNNs from sequences to (binary) trees
- Linear transformation + non-linear activation function applied
recursively throughout a tree
SLIDE 84 Software
- Tensorflow
- Python, C++
- http://www.tensorflow.org
- Theano
- Python
- http://deeplearning.net/software/theano/
- Keras
- Theano/tensorflow-based modular deep learning library
- Lasagne
- Theano-based deep learning library