Machine Learning for Computational Linguistics Recurrent neural - - PowerPoint PPT Presentation
Machine Learning for Computational Linguistics Recurrent neural - - PowerPoint PPT Presentation
Machine Learning for Computational Linguistics Recurrent neural networks (RNNs) ar ltekin University of Tbingen Seminar fr Sprachwissenschaft July 5, 2016 Neural networks: a quick summary Recurrent neural networks July 5, 2016
Neural networks: a quick summary Recurrent neural networks
Feed-forward networks
x1 h1 y1 x2 h2 y2 f() g() 1 1
w
(1) 01
w(1)
2
w(2)
1
w(2)
02
w(1)
11
w
(1) 12
w(1)
21
w(1)
22
w
( 2 ) 1 1
w(2)
2 1
w
( 2 ) 1 2
w
( 2 ) 2 2
h = f(W(1)x) y = g(W(2)h) = g(W(2)f(W(1)x))
- f() and g() are non-linear
functions, such as logistic sigmoid, tanh, or ReLU
- weights are updated using
backpropagation
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 1 / 22
Neural networks: a quick summary Recurrent neural networks
Dense (word) representations
- Dense vector representations are useful for many ML
methods, particularly for neural networks
- Unlike sparse (one-of-K / one-hot) representations, dense
representations represent similarities/difgerences between words, as well as relations between them
- General-purpose word vectors can be trained with
unlabeled data
- They can also be trained for the task at hand
- Two methods to obtain (general purpose) dense
representations:
– global statistics over the complete data (e.g., SVD) – predicting the local environment (word2vec, GloVe)
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 2 / 22
Neural networks: a quick summary Recurrent neural networks
Deep feed-forward networks
x1 xm …
- Deep neural networks (>2 hidden
layers) have recently been successful in many task
- They are particularly useful in
problems where layers/hierarchies of features are useful
- Training deep networks with
back propagation may result in vanishing or exploding gradients
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 3 / 22
Neural networks: a quick summary Recurrent neural networks
Convolutional networks
x1 h1 x2 h2 h
′
1
x3 h3 h
′
2
x4 h4 h
′
3
x5 h5
Convolution Pooling
- Convolution transforms input by replacing each input unit
by a weighted some of its neighbors
- Typically it is followed by pooling
- CNNs are useful to detect local features with some amount
- f location invariance
- Sparse connectivity makes CNNs computationally effjcient
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 4 / 22
Neural networks: a quick summary Recurrent neural networks
CNNs for NLP
not really worth seeing
Input Word vectors Convolution Feature maps Pooling Features Classifjer
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 5 / 22
Neural networks: a quick summary Recurrent neural networks
Recurrent neural networks: motivation
- Feed forward networks
– can only learn associations – they do not have memory of earlier inputs: they cannot handle sequences
- Recurrent neural networks are NN solution for sequence
learning
- This is achieved by recursive loops in the network
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 6 / 22
Neural networks: a quick summary Recurrent neural networks
Recurrent neural networks
x1 h1 x2 h2 x3 h3 x4 h4 y
- Recurrent neural networks are similar to the standard
feed-forward networks But they include loops that use previous output (of the hidden layers) as well as the input Forward calculation is straightforward, learning becomes somewhat tricky
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 7 / 22
Neural networks: a quick summary Recurrent neural networks
Recurrent neural networks
x1 h1 x2 h2 x3 h3 x4 h4 y
- Recurrent neural networks are similar to the standard
feed-forward networks
- But they include loops that use previous output (of the
hidden layers) as well as the input Forward calculation is straightforward, learning becomes somewhat tricky
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 7 / 22
Neural networks: a quick summary Recurrent neural networks
Recurrent neural networks
x1 h1 x2 h2 x3 h3 x4 h4 y
- Recurrent neural networks are similar to the standard
feed-forward networks
- But they include loops that use previous output (of the
hidden layers) as well as the input
- Forward calculation is straightforward, learning becomes
somewhat tricky
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 7 / 22
Neural networks: a quick summary Recurrent neural networks
A simple version: SRNs
Elman (1990)
Input Context units Hidden units Output units c
- p
y
- The network keeps
previous hidden states (context units)
- The rest is just like a
feed-forward network
- Training is simple,
but cannot learn long-distance dependencies
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 8 / 22
Neural networks: a quick summary Recurrent neural networks
Processing sequences with RNNs
- RNNs process sequences one unit at a time
- The earlier input afgects the output through the recurrent
links h1 h2 h3 h4 y
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 9 / 22
Neural networks: a quick summary Recurrent neural networks
Processing sequences with RNNs
- RNNs process sequences one unit at a time
- The earlier input afgects the output through the recurrent
links h1 h2 h3 h4 y not
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 9 / 22
Neural networks: a quick summary Recurrent neural networks
Processing sequences with RNNs
- RNNs process sequences one unit at a time
- The earlier input afgects the output through the recurrent
links h1 h2 h3 h4 y really
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 9 / 22
Neural networks: a quick summary Recurrent neural networks
Processing sequences with RNNs
- RNNs process sequences one unit at a time
- The earlier input afgects the output through the recurrent
links h1 h2 h3 h4 y worth
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 9 / 22
Neural networks: a quick summary Recurrent neural networks
Processing sequences with RNNs
- RNNs process sequences one unit at a time
- The earlier input afgects the output through the recurrent
links h1 h2 h3 h4 y seeing
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 9 / 22
Neural networks: a quick summary Recurrent neural networks
Learning in recurrent networks
x h(1) y(1) W0 W1 W3
- We need to learn three sets of
weights: W0, W1 and W0
- Backpropagation in RNNs are at
fjrst not that obvious
- It is not immediately obvious
how errors should be backpropagated
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 10 / 22
Neural networks: a quick summary Recurrent neural networks
Unrolling a recurrent network
Back propagation through time (BPTT)
x(0) x(1) … x(t−1) x(t) h(0) h(1) … h(t−1) h(t) y(0) y(1) … y(t−1) y(t) Note: the weights with the same color are shared.
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 11 / 22
Neural networks: a quick summary Recurrent neural networks
RNN architectures
Many-to-many (e.g., POS tagging)
x(0) x(1) … x(t−1) x(t) h(0) h(1) … h(t−1) h(t) y(0) y(1) … y(t−1) y(t)
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 12 / 22
Neural networks: a quick summary Recurrent neural networks
RNN architectures
Many-to-one (e.g., document classifjcation)
x(0) x(1) … x(t−1) x(t) h(0) h(1) … h(t−1) h(t) y(t)
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 12 / 22
Neural networks: a quick summary Recurrent neural networks
RNN architectures
Many-to-one with a delay (e.g., machine translation)
x(0) x(1) … x(t−1) x(t) h(0) h(1) … h(t−1) h(t) y(t−1) y(t)
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 12 / 22
Neural networks: a quick summary Recurrent neural networks
Bidirectional RNNs
x(t−1) x(t) x(t+1) y(t−1) y(t) y(t−1) Forward states … … Backward states … …
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 13 / 22
Neural networks: a quick summary Recurrent neural networks
A short divergence: language models
- Language models are useful in many NLP tasks
- A language model defjnes a probability distribution over
sequences of words
- An ngram model assigns probabilities of a sequence of
words P(w1, . . . , wm) ≈
m
∏
i=1
P(wi|wi−1, . . . , wi−(n−1))
- Conditional probabilities are estimated from a (unlabeled)
corpus
- Larger ngrams require lots of memory, and their
probabilities cannot be estimated reliably due to sparsity
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 14 / 22
Neural networks: a quick summary Recurrent neural networks
RNNs as language models
- RNNs can function as language models
- We can train RNNs using unlabeled data for this purpose
- During training the task of RNN is to predict the next word
- Depending on the network confjguration, an RNN can
learn dependencies at a longer distance
- The resulting system can generate sequences
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 15 / 22
Neural networks: a quick summary Recurrent neural networks
RNNs as language models: a fun example
RNNs trained on character sequences generating Shakespeare PANDARUS: Alas, I think he shall be come approached and the day When little srain would be attain'd into being never fed, And who is but a chain and subjects of his death, I should not sleep. Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states. DUKE VINCENTIO: Well, your wit is in the care of side and that. Second Lord: They would be ruled after this chamber, and my fair nues begun out of the fact, to be conveyed, Whose noble souls I'll have the heart of the wars. Clown: Come, sir, I will make did behold your worship.
From Andrej Karpathy’s blog at http://karpathy.github.io/2015/05/21/rnn-effectiveness/ Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 16 / 22
Neural networks: a quick summary Recurrent neural networks
Unstable gradients revisited
- We noted earlier that the gradients may vanish or explode
during backpropagation in deep networks
- This is specially problematic for RNNs since the efgective
dept of the network can be extremely large
- Although RNNs can theoretically learn long term
dependencies, long-distance dependencies are becomes a problem
- The most popular solution is to use gated recurrent
networks
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 17 / 22
Neural networks: a quick summary Recurrent neural networks
LSTMs
- Long short-term memory models (LSTM) are most popular
‘gated’ RNNs
- They are shown to perform well in many sequence
learning problems
- In essence they are similar to simple RNNs but can learn
longer dependencies
- Although LSTMs date back to 1990s, they are very popular
at present, there are many variants that are developed recently
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 18 / 22
Neural networks: a quick summary Recurrent neural networks
LSTMs: the picture
Source: Christopher Olah’s blog http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 19 / 22
Neural networks: a quick summary Recurrent neural networks
LSTMs: the internal ‘cell’ memory
- C is a (long-term) internal
memory
- C is calculated at each time
step using relatively straightforward calculations
- C can hold information for
long stretches of time
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 20 / 22
Neural networks: a quick summary Recurrent neural networks
LSTMs: step-by-step explanation
ft = σ(Wf[ht−1, xt] tanh tanh
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 21 / 22
Neural networks: a quick summary Recurrent neural networks
LSTMs: step-by-step explanation
ft = σ(Wf[ht−1, xt] it = σ(Wi[ht−1, xt]
- Ct = tanh(WC[ht−1, xt]
tanh
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 21 / 22
Neural networks: a quick summary Recurrent neural networks
LSTMs: step-by-step explanation
ft = σ(Wf[ht−1, xt] it = σ(Wi[ht−1, xt]
- Ct = tanh(WC[ht−1, xt]
Ct = ft ⊗ Ct−1 + it ⊗ Ct tanh
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 21 / 22
Neural networks: a quick summary Recurrent neural networks
LSTMs: step-by-step explanation
ft = σ(Wf[ht−1, xt] it = σ(Wi[ht−1, xt]
- Ct = tanh(WC[ht−1, xt]
Ct = ft ⊗ Ct−1 + it ⊗ Ct
- t = σ(Wo[ht−1, xt])
ht = ot ⊗ tanh(Ct)
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 21 / 22
Neural networks: a quick summary Recurrent neural networks
Summary, concluding remarks
- Recurrent neural networks include recursive links
- They keep a state, allowing network to learn sequences
- The RNNs are Turing-complete: they can approximate any
dynamical system
- The simple recurrent networks sufger from unstable
gradients, and cannot learn long-term dependencies
- LSTMs and other gated RNNs solve the long-term
dependency problem (to some extent)
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 22 / 22
Credits and References
- LSTM diagrams are from Christopher Olah’s blog at
http://colah.github.io/posts/ 2015-08-Understanding-LSTMs/, which is also a very good introduction to LSTMs
- Also see Andrej Karpathy’s blog at http://karpathy.
github.io/2015/05/21/rnn-effectiveness/, where the Shakespeare example comes from and it is a very nicely written introduction to RNNs
Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 A.1