Machine Learning for Computational Linguistics Recurrent neural - - PowerPoint PPT Presentation

machine learning for computational linguistics
SMART_READER_LITE
LIVE PREVIEW

Machine Learning for Computational Linguistics Recurrent neural - - PowerPoint PPT Presentation

Machine Learning for Computational Linguistics Recurrent neural networks (RNNs) ar ltekin University of Tbingen Seminar fr Sprachwissenschaft July 5, 2016 Neural networks: a quick summary Recurrent neural networks July 5, 2016


slide-1
SLIDE 1

Machine Learning for Computational Linguistics

Recurrent neural networks (RNNs) Çağrı Çöltekin

University of Tübingen Seminar für Sprachwissenschaft

July 5, 2016

slide-2
SLIDE 2

Neural networks: a quick summary Recurrent neural networks

Feed-forward networks

x1 h1 y1 x2 h2 y2 f() g() 1 1

w

(1) 01

w(1)

2

w(2)

1

w(2)

02

w(1)

11

w

(1) 12

w(1)

21

w(1)

22

w

( 2 ) 1 1

w(2)

2 1

w

( 2 ) 1 2

w

( 2 ) 2 2

h = f(W(1)x) y = g(W(2)h) = g(W(2)f(W(1)x))

  • f() and g() are non-linear

functions, such as logistic sigmoid, tanh, or ReLU

  • weights are updated using

backpropagation

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 1 / 22

slide-3
SLIDE 3

Neural networks: a quick summary Recurrent neural networks

Dense (word) representations

  • Dense vector representations are useful for many ML

methods, particularly for neural networks

  • Unlike sparse (one-of-K / one-hot) representations, dense

representations represent similarities/difgerences between words, as well as relations between them

  • General-purpose word vectors can be trained with

unlabeled data

  • They can also be trained for the task at hand
  • Two methods to obtain (general purpose) dense

representations:

– global statistics over the complete data (e.g., SVD) – predicting the local environment (word2vec, GloVe)

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 2 / 22

slide-4
SLIDE 4

Neural networks: a quick summary Recurrent neural networks

Deep feed-forward networks

x1 xm …

  • Deep neural networks (>2 hidden

layers) have recently been successful in many task

  • They are particularly useful in

problems where layers/hierarchies of features are useful

  • Training deep networks with

back propagation may result in vanishing or exploding gradients

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 3 / 22

slide-5
SLIDE 5

Neural networks: a quick summary Recurrent neural networks

Convolutional networks

x1 h1 x2 h2 h

1

x3 h3 h

2

x4 h4 h

3

x5 h5

Convolution Pooling

  • Convolution transforms input by replacing each input unit

by a weighted some of its neighbors

  • Typically it is followed by pooling
  • CNNs are useful to detect local features with some amount
  • f location invariance
  • Sparse connectivity makes CNNs computationally effjcient

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 4 / 22

slide-6
SLIDE 6

Neural networks: a quick summary Recurrent neural networks

CNNs for NLP

not really worth seeing

Input Word vectors Convolution Feature maps Pooling Features Classifjer

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 5 / 22

slide-7
SLIDE 7

Neural networks: a quick summary Recurrent neural networks

Recurrent neural networks: motivation

  • Feed forward networks

– can only learn associations – they do not have memory of earlier inputs: they cannot handle sequences

  • Recurrent neural networks are NN solution for sequence

learning

  • This is achieved by recursive loops in the network

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 6 / 22

slide-8
SLIDE 8

Neural networks: a quick summary Recurrent neural networks

Recurrent neural networks

x1 h1 x2 h2 x3 h3 x4 h4 y

  • Recurrent neural networks are similar to the standard

feed-forward networks But they include loops that use previous output (of the hidden layers) as well as the input Forward calculation is straightforward, learning becomes somewhat tricky

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 7 / 22

slide-9
SLIDE 9

Neural networks: a quick summary Recurrent neural networks

Recurrent neural networks

x1 h1 x2 h2 x3 h3 x4 h4 y

  • Recurrent neural networks are similar to the standard

feed-forward networks

  • But they include loops that use previous output (of the

hidden layers) as well as the input Forward calculation is straightforward, learning becomes somewhat tricky

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 7 / 22

slide-10
SLIDE 10

Neural networks: a quick summary Recurrent neural networks

Recurrent neural networks

x1 h1 x2 h2 x3 h3 x4 h4 y

  • Recurrent neural networks are similar to the standard

feed-forward networks

  • But they include loops that use previous output (of the

hidden layers) as well as the input

  • Forward calculation is straightforward, learning becomes

somewhat tricky

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 7 / 22

slide-11
SLIDE 11

Neural networks: a quick summary Recurrent neural networks

A simple version: SRNs

Elman (1990)

Input Context units Hidden units Output units c

  • p

y

  • The network keeps

previous hidden states (context units)

  • The rest is just like a

feed-forward network

  • Training is simple,

but cannot learn long-distance dependencies

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 8 / 22

slide-12
SLIDE 12

Neural networks: a quick summary Recurrent neural networks

Processing sequences with RNNs

  • RNNs process sequences one unit at a time
  • The earlier input afgects the output through the recurrent

links h1 h2 h3 h4 y

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 9 / 22

slide-13
SLIDE 13

Neural networks: a quick summary Recurrent neural networks

Processing sequences with RNNs

  • RNNs process sequences one unit at a time
  • The earlier input afgects the output through the recurrent

links h1 h2 h3 h4 y not

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 9 / 22

slide-14
SLIDE 14

Neural networks: a quick summary Recurrent neural networks

Processing sequences with RNNs

  • RNNs process sequences one unit at a time
  • The earlier input afgects the output through the recurrent

links h1 h2 h3 h4 y really

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 9 / 22

slide-15
SLIDE 15

Neural networks: a quick summary Recurrent neural networks

Processing sequences with RNNs

  • RNNs process sequences one unit at a time
  • The earlier input afgects the output through the recurrent

links h1 h2 h3 h4 y worth

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 9 / 22

slide-16
SLIDE 16

Neural networks: a quick summary Recurrent neural networks

Processing sequences with RNNs

  • RNNs process sequences one unit at a time
  • The earlier input afgects the output through the recurrent

links h1 h2 h3 h4 y seeing

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 9 / 22

slide-17
SLIDE 17

Neural networks: a quick summary Recurrent neural networks

Learning in recurrent networks

x h(1) y(1) W0 W1 W3

  • We need to learn three sets of

weights: W0, W1 and W0

  • Backpropagation in RNNs are at

fjrst not that obvious

  • It is not immediately obvious

how errors should be backpropagated

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 10 / 22

slide-18
SLIDE 18

Neural networks: a quick summary Recurrent neural networks

Unrolling a recurrent network

Back propagation through time (BPTT)

x(0) x(1) … x(t−1) x(t) h(0) h(1) … h(t−1) h(t) y(0) y(1) … y(t−1) y(t) Note: the weights with the same color are shared.

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 11 / 22

slide-19
SLIDE 19

Neural networks: a quick summary Recurrent neural networks

RNN architectures

Many-to-many (e.g., POS tagging)

x(0) x(1) … x(t−1) x(t) h(0) h(1) … h(t−1) h(t) y(0) y(1) … y(t−1) y(t)

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 12 / 22

slide-20
SLIDE 20

Neural networks: a quick summary Recurrent neural networks

RNN architectures

Many-to-one (e.g., document classifjcation)

x(0) x(1) … x(t−1) x(t) h(0) h(1) … h(t−1) h(t) y(t)

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 12 / 22

slide-21
SLIDE 21

Neural networks: a quick summary Recurrent neural networks

RNN architectures

Many-to-one with a delay (e.g., machine translation)

x(0) x(1) … x(t−1) x(t) h(0) h(1) … h(t−1) h(t) y(t−1) y(t)

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 12 / 22

slide-22
SLIDE 22

Neural networks: a quick summary Recurrent neural networks

Bidirectional RNNs

x(t−1) x(t) x(t+1) y(t−1) y(t) y(t−1) Forward states … … Backward states … …

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 13 / 22

slide-23
SLIDE 23

Neural networks: a quick summary Recurrent neural networks

A short divergence: language models

  • Language models are useful in many NLP tasks
  • A language model defjnes a probability distribution over

sequences of words

  • An ngram model assigns probabilities of a sequence of

words P(w1, . . . , wm) ≈

m

i=1

P(wi|wi−1, . . . , wi−(n−1))

  • Conditional probabilities are estimated from a (unlabeled)

corpus

  • Larger ngrams require lots of memory, and their

probabilities cannot be estimated reliably due to sparsity

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 14 / 22

slide-24
SLIDE 24

Neural networks: a quick summary Recurrent neural networks

RNNs as language models

  • RNNs can function as language models
  • We can train RNNs using unlabeled data for this purpose
  • During training the task of RNN is to predict the next word
  • Depending on the network confjguration, an RNN can

learn dependencies at a longer distance

  • The resulting system can generate sequences

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 15 / 22

slide-25
SLIDE 25

Neural networks: a quick summary Recurrent neural networks

RNNs as language models: a fun example

RNNs trained on character sequences generating Shakespeare PANDARUS: Alas, I think he shall be come approached and the day When little srain would be attain'd into being never fed, And who is but a chain and subjects of his death, I should not sleep. Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states. DUKE VINCENTIO: Well, your wit is in the care of side and that. Second Lord: They would be ruled after this chamber, and my fair nues begun out of the fact, to be conveyed, Whose noble souls I'll have the heart of the wars. Clown: Come, sir, I will make did behold your worship.

From Andrej Karpathy’s blog at http://karpathy.github.io/2015/05/21/rnn-effectiveness/ Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 16 / 22

slide-26
SLIDE 26

Neural networks: a quick summary Recurrent neural networks

Unstable gradients revisited

  • We noted earlier that the gradients may vanish or explode

during backpropagation in deep networks

  • This is specially problematic for RNNs since the efgective

dept of the network can be extremely large

  • Although RNNs can theoretically learn long term

dependencies, long-distance dependencies are becomes a problem

  • The most popular solution is to use gated recurrent

networks

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 17 / 22

slide-27
SLIDE 27

Neural networks: a quick summary Recurrent neural networks

LSTMs

  • Long short-term memory models (LSTM) are most popular

‘gated’ RNNs

  • They are shown to perform well in many sequence

learning problems

  • In essence they are similar to simple RNNs but can learn

longer dependencies

  • Although LSTMs date back to 1990s, they are very popular

at present, there are many variants that are developed recently

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 18 / 22

slide-28
SLIDE 28

Neural networks: a quick summary Recurrent neural networks

LSTMs: the picture

Source: Christopher Olah’s blog http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 19 / 22

slide-29
SLIDE 29

Neural networks: a quick summary Recurrent neural networks

LSTMs: the internal ‘cell’ memory

  • C is a (long-term) internal

memory

  • C is calculated at each time

step using relatively straightforward calculations

  • C can hold information for

long stretches of time

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 20 / 22

slide-30
SLIDE 30

Neural networks: a quick summary Recurrent neural networks

LSTMs: step-by-step explanation

ft = σ(Wf[ht−1, xt] tanh tanh

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 21 / 22

slide-31
SLIDE 31

Neural networks: a quick summary Recurrent neural networks

LSTMs: step-by-step explanation

ft = σ(Wf[ht−1, xt] it = σ(Wi[ht−1, xt]

  • Ct = tanh(WC[ht−1, xt]

tanh

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 21 / 22

slide-32
SLIDE 32

Neural networks: a quick summary Recurrent neural networks

LSTMs: step-by-step explanation

ft = σ(Wf[ht−1, xt] it = σ(Wi[ht−1, xt]

  • Ct = tanh(WC[ht−1, xt]

Ct = ft ⊗ Ct−1 + it ⊗ Ct tanh

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 21 / 22

slide-33
SLIDE 33

Neural networks: a quick summary Recurrent neural networks

LSTMs: step-by-step explanation

ft = σ(Wf[ht−1, xt] it = σ(Wi[ht−1, xt]

  • Ct = tanh(WC[ht−1, xt]

Ct = ft ⊗ Ct−1 + it ⊗ Ct

  • t = σ(Wo[ht−1, xt])

ht = ot ⊗ tanh(Ct)

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 21 / 22

slide-34
SLIDE 34

Neural networks: a quick summary Recurrent neural networks

Summary, concluding remarks

  • Recurrent neural networks include recursive links
  • They keep a state, allowing network to learn sequences
  • The RNNs are Turing-complete: they can approximate any

dynamical system

  • The simple recurrent networks sufger from unstable

gradients, and cannot learn long-term dependencies

  • LSTMs and other gated RNNs solve the long-term

dependency problem (to some extent)

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 22 / 22

slide-35
SLIDE 35

Credits and References

  • LSTM diagrams are from Christopher Olah’s blog at

http://colah.github.io/posts/ 2015-08-Understanding-LSTMs/, which is also a very good introduction to LSTMs

  • Also see Andrej Karpathy’s blog at http://karpathy.

github.io/2015/05/21/rnn-effectiveness/, where the Shakespeare example comes from and it is a very nicely written introduction to RNNs

Ç. Çöltekin, SfS / University of Tübingen July 5, 2016 A.1