Deep Dive on RNNs Charles Martin What is an Artificial Neurone? - - PowerPoint PPT Presentation

deep dive on rnns
SMART_READER_LITE
LIVE PREVIEW

Deep Dive on RNNs Charles Martin What is an Artificial Neurone? - - PowerPoint PPT Presentation

Deep Dive on RNNs Charles Martin What is an Artificial Neurone? Source - Wikimedia Commons Feed-Forward Network For each unit: y = tanh Wx + b Recurrent Network For each unit: y t = tanh Ux t + Vh t 1 + b Sequence


slide-1
SLIDE 1

Deep Dive on RNNs

Charles Martin

slide-2
SLIDE 2

What is an Artificial Neurone?

Source - Wikimedia Commons

slide-3
SLIDE 3

Feed-Forward Network

For each unit: y = tanh

Wx + b

slide-4
SLIDE 4

Recurrent Network

For each unit: yt = tanh

Uxt + Vht−1 + b

slide-5
SLIDE 5

Sequence Learning Tasks

slide-6
SLIDE 6

Recurrent Network

  • simplifying. . .
slide-7
SLIDE 7

Recurrent Network

simplifying and rotating. . .

slide-8
SLIDE 8

“State” in Recurrent Networks

◮ Recurrent Networks are all about

storing a “state” in between computations.

◮ A “lossy summary of. . . past

sequences”

◮ h is the “hidden state” of our RNN ◮ What influences h?

slide-9
SLIDE 9

Defining the RNN State

We can define a simplified RNN represented by this diagram as follows: ht = tanh

Uxt + Vht−1 + b

  • ˆ

yt = softmax(c + Wht)

slide-10
SLIDE 10

Unfolding an RNN in Time

Figure 1: Unfolding an RNN in Time

◮ By unfolding the RNN we can compute ˆ

y for a given length of sequence.

◮ Note that the weight matrices U, V , W are the same for each timestep; this is the

big advantage of RNNs!

slide-11
SLIDE 11

Forward Propagation

We can now use the following equations to compute ˆ y3, by computing h for the previous steps: ht = tanh

Uxt + Vht−1 + b

  • ˆ

yt = softmax(c + Wht)

slide-12
SLIDE 12

Y-hat is Softmax’d

ˆ y is a probability distribution! A finite number of weights that add to 1: σ(z)j = ezj

K

k=1 ezk for j = 1, . . . , K

slide-13
SLIDE 13

Calculating Loss: Categorical Cross Entropy

We use the categorical cross-entropy function for loss: ht = tanh

b + Vht−1 + Uxt

  • ˆ

yt = softmax(c + Wht) Lt = −yt · log(ˆ yt) Loss =

  • t

Lt

slide-14
SLIDE 14

Backpropagation Through Time (BPTT)

Propagates error correction backwards through the network graph, adjusting all parameters (U, V, W) to minimise loss.

slide-15
SLIDE 15

Example: Character-level text model

◮ Training data: a collection of text. ◮ Input (X): snippets of 30 characters from the collection. ◮ Target output (y): 1 character, the next one after the 30 in each X.

slide-16
SLIDE 16

Training the Character-level Model

◮ Target: A probability distribution

with P(n) = 1

◮ Output: A probability distribution

  • ver all next letters.

◮ E.g.: “My cat is named Simon”

would lead to X: “My cat is named Simo” and y: “n”

slide-17
SLIDE 17

Using the trained model to generate text

◮ S: Sampling function, sample a letter

using the output probability distribution.

◮ The generated letter is reinserted at

as the next input.

◮ We don’t want to always draw the

most likely character. The would give frequent repetition and “copying” from the training text. Need a sampling strategy.

slide-18
SLIDE 18

Char-RNN

◮ RNN as a sequence generator ◮ Input is current symbol, output is

next predicted symbol.

◮ Connect output to input and

continue!

◮ CharRNN simply applies this to a

(subset) of ASCII characters.

◮ Train and generate on any text

corpus: Fun! See: Karpathy, A. (2015). The unreasonable effectiveness of recurrent neural networks.

slide-19
SLIDE 19

Char-RNN Examples

Shakespeare (Karpathy, 2015): Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states. DUKE VINCENTIO: Well, your wit is in the care of side and that. Latex Algebraic Geometry: N.B. “Proof. Omitted.” Lol.

slide-20
SLIDE 20

RNN Architectures and LSTM

slide-21
SLIDE 21

Bidirectional RNNs

◮ Useful for tasks where the whole

sequence is available.

◮ Each output unit (ˆ

y) depends on both past and future - but most sensitive to closer times.

◮ Popular in speech recognition,

translation etc.

slide-22
SLIDE 22

Encoder-Decoder (seq-to-seq)

◮ Learns to generate output sequence

(y) from an input sequence (x).

◮ Final hidden state of encoder is used

to compute a context variable C.

◮ For example, translation.

slide-23
SLIDE 23

Deep RNNs

◮ Does adding deeper layers to an RNN

make it work better?

◮ Several options for architecture. ◮ Simply stacking RNN layers is very

popular; shown to work better by Graves et al. (2013)

◮ Intuitively: layers might learn some

hierarchical knowledge automatically.

◮ Typical setup: up to three recurrent

layers.

slide-24
SLIDE 24

Long-Term Dependencies

◮ Learning long dependencies is a

mathematical challenge.

◮ Basically: gradients propagated

through the same weights tend to vanish (mostly) or explode (rarely)

◮ E.g., consider a simplified RNN with

no nonlinear activation function or input.

◮ Each time step multiplies h(0) by W. ◮ This corresponds to raising power of

eigenvalues in Λ.

◮ Eventually, components of h(0) not

aligned with the largest eigenvector will be discarded. ht = Wht−1 ht = (W t)h0 (supposing W admits eigendecomposition with orthogonal matrix Q) W = QΛQ⊤ ht = QΛtQh0

slide-25
SLIDE 25

Vanishing and Exploding Gradients

◮ “in order to store memories in a way

that is robust to small perturbations, the RNN must enter a region of parameter space where gradients vanish”

◮ “whenever the model is able to

represent long term dependencies, the gradient of a long term interaction has exponentially smaller magnitude than the gradient of a short term interaction.”

◮ Note that this problem is only

relevant for recurrent networks since the weights W affecting the hidden state are the same at each time step.

◮ Goodfellow and Benigo (2016): “the

problem of learning long-term dependencies remains one of the main challenges in deep learning”

◮ WildML (2015). Backpropagation

Through Time and Vanishing Gradients

◮ ML for artists

slide-26
SLIDE 26

Gated RNNs

◮ Possible solution! ◮ Provide a gate that can change the

hidden state a little bit at each step.

◮ The gates are controlled by

learnable weights as well!

◮ Hidden state weights that may

change at each time step.

◮ Create paths through time with

derivatives that do not vanish/explode.

◮ Gates choose information to

accumulate or forget at each time step.

◮ Most effective sequence models

used in practice!

slide-27
SLIDE 27

Long Short-Term Memory

◮ Self-loop containing an internal state

(c).

◮ Three extra gating units:

◮ Forget gate: controls how much

memory is preserved.

◮ Input gate: control how much of

current input is stored.

◮ Output gate: control how much

  • f state is shown to output.

◮ Each gate has own weights and

biases, so this uses lots more parameters.

◮ Some variants on this design, e.g.,

use c as additional input to three gate units.

slide-28
SLIDE 28

Long Short-Term Memory

◮ Forget gate: f ◮ Internal state: s ◮ Input gate: g ◮ Output gate: q ◮ Output: h

slide-29
SLIDE 29

Other Gating Units

Source: (Olah, C. 2015.)

◮ Are three gates necessary? ◮ Other gating units are simpler, e.g.,

Gated Recurrent Unit (GRU)

◮ For the moment, LSTMs are winning

in practical use.

◮ Maybe someone wants to explore

alternatives in a project?

slide-30
SLIDE 30

Visualising LSTM activations

Sometimes, the LSTM cell state corresponds with features of the sequential data: Source: (Karpathy, 2015)

slide-31
SLIDE 31

CharRNN Applications: FolkRNN

Some kinds of music can be represented in a text-like manner. Source: Sturm et al. 2015. Folk Music Style Modelling by Recurrent Neural Networks with Long Short Term Memory Units

slide-32
SLIDE 32

Other CharRNN Applications

Teaching Recurrent Neural Networks about Monet

slide-33
SLIDE 33

Google Magenta Performance RNN

◮ State-of-the-art in music generating RNNs. ◮ Encode MIDI musical sequences as categorical data. ◮ Now supports polyphony (multiple notes), dynamics (volume), expressive timing

slide-34
SLIDE 34

Neural iPad Band, another CharRNN

◮ iPad music transcribed as sequence

  • f numbers for each performer.

◮ Trick: encode multiple ints as one

(preserving ordering).

◮ Video

slide-35
SLIDE 35

Books and Learning References

◮ Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT

Press.

◮ François Chollet. 2018. Manning. ◮ Chris Olah. 2015. Understanding LSTMs ◮ RNNs in Tensorflow ◮ Maybe RNN/LSTM is dead? CNNs can work similarly to BLSTMs ◮ Karpathy. 2015. The Unreasonable Effectiveness of RNNs

slide-36
SLIDE 36

Summary

◮ Recurrent Neural Networks let us capture and model the structure of sequential

data.

◮ Sampling from trained RNNs allow us to generate new, creative sequences. ◮ The internal state of RNNs make them interesting for interactive applications, since

it lets them capture and continue from the current context or “style”.

◮ LSTM units are able to overcome the vanishing gradient problem to some extent.