Recurrent Neural Networks for Language Modeling CSE392 - Spring - - PowerPoint PPT Presentation

recurrent neural networks for language modeling
SMART_READER_LITE
LIVE PREVIEW

Recurrent Neural Networks for Language Modeling CSE392 - Spring - - PowerPoint PPT Presentation

Recurrent Neural Networks for Language Modeling CSE392 - Spring 2019 Special Topic in CS Tasks Recurrent Neural Network and Language Modeling: Generate how? Sequence Models next word, sentence capture hidden representation of


slide-1
SLIDE 1

Recurrent Neural Networks for Language Modeling

CSE392 - Spring 2019 Special Topic in CS

slide-2
SLIDE 2

Tasks

  • Language Modeling: Generate

next word, sentence ≈ capture hidden representation of sentences.

  • Recurrent Neural Network and

Sequence Models how?

slide-3
SLIDE 3

Language Modeling Task: Estimate P(wn| w1, w2, …, wn-1) :probability of a next word given history P(fork | He ate the cake with the) = ?

slide-4
SLIDE 4

Language Modeling

History

(He, at, the, cake, with, the)

Trained Language Model

What is the next word in the sequence? Training Corpus

training (fit, learn)

Task: Estimate P(wn| w1, w2, …, wn-1) :probability of a next word given history P(fork | He ate the cake with the) = ?

icing the fork carrots cheese spoon

slide-5
SLIDE 5

Neural Networks: Graphs of Operations (excluding the optimization nodes)

(Jurafsky, 2019)

“hidden layer” y(t) = f(h(t)W) Activation Function h(t) = g(h(t-1) U + x(t)V)

slide-6
SLIDE 6

Language Modeling

History

(He, at, the, cake, with, the)

Trained Language Model

What is the next word in the sequence? Training Corpus

training (fit, learn)

Task: Estimate P(wn| w1, w2, …, wn-1) :probability of a next word given history P(fork | He ate the cake with the) = ?

icing the fork carrots cheese spoon

slide-7
SLIDE 7

Language Modeling

HistoryLast word

(He, at, the, cake, with, the)

Trained Language Model

What is the next word in the sequence?

Task: Estimate P(wn| w1, w2, …, wn-1) :probability of a next word given history P(fork | He ate the cake with the) = ?

icing the fork carrots cheese spoon

Training Corpus

training (fit, learn)

slide-8
SLIDE 8

Language Modeling

Last word

(the)

Trained Language Model

What is the next word in the sequence?

Task: Estimate P(wn| w1, w2, …, wn-1) :probability of a next word given history P(fork | He ate the cake with the) = ?

icing the fork carrots cheese spoon

Training Corpus

training (fit, learn)

ht: a vector that we hope “stores” relevant history from previous inputs: He, at, the, cake, with,

slide-9
SLIDE 9

Optimization: Backward Propagation

... #define forward pass graph: h(0) = 0 for i in range(1, len(x)): h(i) = tf.tanh(tf.matmul(U,h(i-1))+ tf.matmul(W,x(i))) #update hidden state y(i) = tf.softmax(tf.matmul(V, h(i))) #update output ... cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(y_pred))

cost

slide-10
SLIDE 10

Optimization: Backward Propagation

... #define forward pass graph: h(0) = 0 for i in range(1, len(x)): h(i) = tf.tanh(tf.matmul(U,h(i-1))+ tf.matmul(W,x(i))) #update hidden state y(i) = tf.softmax(tf.matmul(V, h(i))) #update output ... cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(y_pred))

To find the gradient for the overall graph, we use back propogation, which essentially chains together the gradients for each node (function) in the graph. With many recursions, the gradients can vanish or explode (become too large or small for floating point operations). cost

slide-11
SLIDE 11

Optimization: Backward Propagation

cost

(Geron, 2017)

slide-12
SLIDE 12

How to address exploding and vanishing gradients?

Ad Hoc approaches: e.g. stop backprop iterations very early. “clip” gradients when too high.

slide-13
SLIDE 13

How to address exploding and vanishing gradients?

Dominant approach: Use Long Short Term Memory Networks (LSTM) RNN model “unrolled” depiction

(Geron, 2017)

slide-14
SLIDE 14

How to address exploding and vanishing gradients?

The LSTM Cell RNN model “unrolled” depiction

(Geron, 2017)

slide-15
SLIDE 15

How to address exploding and vanishing gradients?

The LSTM Cell RNN model “unrolled” depiction

(Geron, 2017)

“long term state” “short term state”

slide-16
SLIDE 16

How to address exploding and vanishing gradients?

The LSTM Cell RNN model “unrolled” depiction

(Geron, 2017)

“long term state” “short term state”

slide-17
SLIDE 17

How to address exploding and vanishing gradients?

The LSTM Cell “long term state” “short term state”

slide-18
SLIDE 18

How to address exploding and vanishing gradients?

The LSTM Cell “long term state” “short term state” bias term

slide-19
SLIDE 19

Common Activation Functions

z = h(t)W

Logistic: 𝜏(z) = 1 / (1 + e-z) Hyperbolic tangent: tanh(z) = 2𝜏(2z) - 1 = (e2z - 1) / (e2z + 1) Rectified linear unit (ReLU): ReLU(z) = max(0, z)

slide-20
SLIDE 20

LSTM

The LSTM Cell “long term state” “short term state”

slide-21
SLIDE 21

The LSTM Cell “long term state” “short term state”

LSTM

slide-22
SLIDE 22

The LSTM Cell “long term state” “short term state”

LSTM

slide-23
SLIDE 23

Input to LSTM

?

slide-24
SLIDE 24

Input to LSTM

?

  • One-hot encoding?
  • Word Embedding
slide-25
SLIDE 25

Input to LSTM

  • 0.5

3.5 3.21

  • 1.3

1.6

slide-26
SLIDE 26

Input to LSTM

  • 0.5

3.5 3.21

  • 1.3

1.6

  • 2.0

5.5

  • 0.3
  • 1.1

6.3 0.53 2.5 3

  • 2.3

0.76 1.53 1.5

  • 3.2

2.3 10 1.53 1.5

  • 3.2

2.3 10

12 0.15 1.1

  • 0.7
  • 5.4
slide-27
SLIDE 27

Input to LSTM

  • 0.5

3.5 3.21

  • 1.3

1.6

  • 2.0

5.5

  • 0.3
  • 1.1

6.3 0.53 2.5 3

  • 2.3

0.76 1.53 1.5

  • 3.2

2.3 10 1.53 1.5

  • 3.2

2.3 10

12 0.15 1.1

  • 0.7
  • 5.4

same

slide-28
SLIDE 28

The GRU

Gated Recurrent Unit

(Geron, 2017)

slide-29
SLIDE 29

The GRU

Gated Recurrent Unit

(Geron, 2017)

relevance gate update gate

slide-30
SLIDE 30

The GRU

Gated Recurrent Unit

(Geron, 2017)

relevance gate update gate A candidate for updating h, sometimes called: h~

slide-31
SLIDE 31

The GRU

Gated Recurrent Unit The cake, which contained candles, was eaten.

slide-32
SLIDE 32

What about the gradient?

The gates (i.e. multiplications based on a logistic) often end up keeping the hidden state exactly (or nearly exactly) as it was. Thus, for most dimensions of h, h(t) ≈ h(t-1) The cake, which contained candles, was eaten.

slide-33
SLIDE 33

What about the gradient?

The gates (i.e. multiplications based on a logistic) often end up keeping the hidden state exactly (or nearly exactly) as it was. Thus, for most dimensions of h, h(t) ≈ h(t-1) This tends to keep the gradient from vanishing since the same values will be present through multiple times in backpropagation through time. (The same idea applies to LSTMs but is easier to see here). The cake, which contained candles, was eaten.

slide-34
SLIDE 34

How to train an LSTM-style RNN

cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(y_pred))

Cost Function:

  • - ”cross entropy error”
slide-35
SLIDE 35

How to train an LSTM-style RNN

cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(y_pred))

Cost Function:

  • - ”cross entropy error”
slide-36
SLIDE 36

How to train an LSTM-style RNN

cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(y_pred))

Cost Function:

  • - ”cross entropy error”

Stochastic Gradient Descent -- a method

slide-37
SLIDE 37

RNN-Based Language Models

Take-Aways

  • Simple RNNs are difficult to train: exploding and vanishing gradients
  • LSTM and GRU cells solve

○ Hidden states past from one time-step to the next, allow for long-distance dependencies. ○ Gates are used to keep hidden states from changing rapidly (and thus keeps gradients under control). ○ LSTM and GRU are complex, but simply a series of functions: ■ logit (w٠x) ■ tanh (w٠x) ■ element-wise multiplication and addition ○ To train: mini-batch stochastic gradient descent over cross-entropy cost

slide-38
SLIDE 38

0.53 1.5 3.21

  • 2.3

.76

slide-39
SLIDE 39