SLIDE 1
Deep Dive on RNNs Charles Martin What is an Artificial Neurone? - - PowerPoint PPT Presentation
Deep Dive on RNNs Charles Martin What is an Artificial Neurone? - - PowerPoint PPT Presentation
Deep Dive on RNNs Charles Martin What is an Artificial Neurone? Source - Wikimedia Commons Feed-Forward Network For each unit: y = tanh Wx + b Recurrent Network For each unit: y t = tanh Ux t + Vh t 1 + b Sequence
SLIDE 2
SLIDE 3
Feed-Forward Network
For each unit: y = tanh
Wx + b
SLIDE 4
Recurrent Network
For each unit: yt = tanh
Uxt + Vht−1 + b
SLIDE 5
Sequence Learning Tasks
SLIDE 6
Recurrent Network
- simplifying. . .
SLIDE 7
Recurrent Network
simplifying and rotating. . .
SLIDE 8
“State” in Recurrent Networks
◮ Recurrent Networks are all about
storing a “state” in between computations.
◮ A “lossy summary of. . . past
sequences”
◮ h is the “hidden state” of our RNN ◮ What influences h?
SLIDE 9
Defining the RNN State
We can define a simplified RNN represented by this diagram as follows: ht = tanh
Uxt + Vht−1 + b
- ˆ
yt = softmax(c + Wht)
SLIDE 10
Unfolding an RNN in Time
Figure 1: Unfolding an RNN in Time
◮ By unfolding the RNN we can compute ˆ
y for a given length of sequence.
◮ Note that the weight matrices U, V , W are the same for each timestep; this is the
big advantage of RNNs!
SLIDE 11
Forward Propagation
We can now use the following equations to compute ˆ y3, by computing h for the previous steps: ht = tanh
Uxt + Vht−1 + b
- ˆ
yt = softmax(c + Wht)
SLIDE 12
Y-hat is Softmax’d
ˆ y is a probability distribution! A finite number of weights that add to 1: σ(z)j = ezj
K
k=1 ezk for j = 1, . . . , K
SLIDE 13
Calculating Loss: Categorical Cross Entropy
We use the categorical cross-entropy function for loss: ht = tanh
b + Vht−1 + Uxt
- ˆ
yt = softmax(c + Wht) Lt = −yt · log(ˆ yt) Loss =
- t
Lt
SLIDE 14
Backpropagation Through Time (BPTT)
Propagates error correction backwards through the network graph, adjusting all parameters (U, V, W) to minimise loss.
SLIDE 15
Example: Character-level text model
◮ Training data: a collection of text. ◮ Input (X): snippets of 30 characters from the collection. ◮ Target output (y): 1 character, the next one after the 30 in each X.
SLIDE 16
Training the Character-level Model
◮ Target: A probability distribution
with P(n) = 1
◮ Output: A probability distribution
- ver all next letters.
◮ E.g.: “My cat is named Simon”
would lead to X: “My cat is named Simo” and y: “n”
SLIDE 17
Using the trained model to generate text
◮ S: Sampling function, sample a letter
using the output probability distribution.
◮ The generated letter is reinserted at
as the next input.
◮ We don’t want to always draw the
most likely character. The would give frequent repetition and “copying” from the training text. Need a sampling strategy.
SLIDE 18
Char-RNN
◮ RNN as a sequence generator ◮ Input is current symbol, output is
next predicted symbol.
◮ Connect output to input and
continue!
◮ CharRNN simply applies this to a
(subset) of ASCII characters.
◮ Train and generate on any text
corpus: Fun! See: Karpathy, A. (2015). The unreasonable effectiveness of recurrent neural networks.
SLIDE 19
Char-RNN Examples
Shakespeare (Karpathy, 2015): Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states. DUKE VINCENTIO: Well, your wit is in the care of side and that. Latex Algebraic Geometry: N.B. “Proof. Omitted.” Lol.
SLIDE 20
RNN Architectures and LSTM
SLIDE 21
Bidirectional RNNs
◮ Useful for tasks where the whole
sequence is available.
◮ Each output unit (ˆ
y) depends on both past and future - but most sensitive to closer times.
◮ Popular in speech recognition,
translation etc.
SLIDE 22
Encoder-Decoder (seq-to-seq)
◮ Learns to generate output sequence
(y) from an input sequence (x).
◮ Final hidden state of encoder is used
to compute a context variable C.
◮ For example, translation.
SLIDE 23
Deep RNNs
◮ Does adding deeper layers to an RNN
make it work better?
◮ Several options for architecture. ◮ Simply stacking RNN layers is very
popular; shown to work better by Graves et al. (2013)
◮ Intuitively: layers might learn some
hierarchical knowledge automatically.
◮ Typical setup: up to three recurrent
layers.
SLIDE 24
Long-Term Dependencies
◮ Learning long dependencies is a
mathematical challenge.
◮ Basically: gradients propagated
through the same weights tend to vanish (mostly) or explode (rarely)
◮ E.g., consider a simplified RNN with
no nonlinear activation function or input.
◮ Each time step multiplies h(0) by W. ◮ This corresponds to raising power of
eigenvalues in Λ.
◮ Eventually, components of h(0) not
aligned with the largest eigenvector will be discarded. ht = Wht−1 ht = (W t)h0 (supposing W admits eigendecomposition with orthogonal matrix Q) W = QΛQ⊤ ht = QΛtQh0
SLIDE 25
Vanishing and Exploding Gradients
◮ “in order to store memories in a way
that is robust to small perturbations, the RNN must enter a region of parameter space where gradients vanish”
◮ “whenever the model is able to
represent long term dependencies, the gradient of a long term interaction has exponentially smaller magnitude than the gradient of a short term interaction.”
◮ Note that this problem is only
relevant for recurrent networks since the weights W affecting the hidden state are the same at each time step.
◮ Goodfellow and Benigo (2016): “the
problem of learning long-term dependencies remains one of the main challenges in deep learning”
◮ WildML (2015). Backpropagation
Through Time and Vanishing Gradients
◮ ML for artists
SLIDE 26
Gated RNNs
◮ Possible solution! ◮ Provide a gate that can change the
hidden state a little bit at each step.
◮ The gates are controlled by
learnable weights as well!
◮ Hidden state weights that may
change at each time step.
◮ Create paths through time with
derivatives that do not vanish/explode.
◮ Gates choose information to
accumulate or forget at each time step.
◮ Most effective sequence models
used in practice!
SLIDE 27
Long Short-Term Memory
◮ Self-loop containing an internal state
(c).
◮ Three extra gating units:
◮ Forget gate: controls how much
memory is preserved.
◮ Input gate: control how much of
current input is stored.
◮ Output gate: control how much
- f state is shown to output.
◮ Each gate has own weights and
biases, so this uses lots more parameters.
◮ Some variants on this design, e.g.,
use c as additional input to three gate units.
SLIDE 28
Long Short-Term Memory
◮ Forget gate: f ◮ Internal state: s ◮ Input gate: g ◮ Output gate: q ◮ Output: h
SLIDE 29
Other Gating Units
Source: (Olah, C. 2015.)
◮ Are three gates necessary? ◮ Other gating units are simpler, e.g.,
Gated Recurrent Unit (GRU)
◮ For the moment, LSTMs are winning
in practical use.
◮ Maybe someone wants to explore
alternatives in a project?
SLIDE 30
Visualising LSTM activations
Sometimes, the LSTM cell state corresponds with features of the sequential data: Source: (Karpathy, 2015)
SLIDE 31
CharRNN Applications: FolkRNN
Some kinds of music can be represented in a text-like manner. Source: Sturm et al. 2015. Folk Music Style Modelling by Recurrent Neural Networks with Long Short Term Memory Units
SLIDE 32
Other CharRNN Applications
Teaching Recurrent Neural Networks about Monet
SLIDE 33
Google Magenta Performance RNN
◮ State-of-the-art in music generating RNNs. ◮ Encode MIDI musical sequences as categorical data. ◮ Now supports polyphony (multiple notes), dynamics (volume), expressive timing
SLIDE 34
Neural iPad Band, another CharRNN
◮ iPad music transcribed as sequence
- f numbers for each performer.
◮ Trick: encode multiple ints as one
(preserving ordering).
◮ Video
SLIDE 35
Books and Learning References
◮ Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT
Press.
◮ François Chollet. 2018. Manning. ◮ Chris Olah. 2015. Understanding LSTMs ◮ RNNs in Tensorflow ◮ Maybe RNN/LSTM is dead? CNNs can work similarly to BLSTMs ◮ Karpathy. 2015. The Unreasonable Effectiveness of RNNs
SLIDE 36