[PPT] - NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig PowerPoint Presentation

SLIDE 1

1

NLP Programming Tutorial 8 – Recurrent Neural Nets

NLP Programming Tutorial 8 - Recurrent Neural Nets

Graham Neubig Nara Institute of Science and Technology (NAIST)

SLIDE 2

2

NLP Programming Tutorial 8 – Recurrent Neural Nets

Feed Forward Neural Nets

All connections point forward

y

ϕ(x)

It is a directed acyclic graph (DAG)

SLIDE 3

3

NLP Programming Tutorial 8 – Recurrent Neural Nets

Recurrent Neural Nets (RNN)

Part of the node outputs return as input

y

ϕt(x) ht−1

Why? It is possible to “memorize”

SLIDE 4

4

NLP Programming Tutorial 8 – Recurrent Neural Nets

RNN in Sequence Modeling

NET NET NET NET

x1 x2 x3 x4 y1 y2 y3 y4

SLIDE 5

5

NLP Programming Tutorial 8 – Recurrent Neural Nets

Example: POS Tagging

NET NET NET NET

natural language processing is JJ NN NN VBZ

SLIDE 6

6

NLP Programming Tutorial 8 – Recurrent Neural Nets

Multi-class Prediction with Neural Networks

SLIDE 7

7

NLP Programming Tutorial 8 – Recurrent Neural Nets

Review: Prediction Problems Given x, predict y

A book review Oh, man I love this book! This book is so boring... Is it positive? yes no

Binary Prediction (2 choices)

A tweet On the way to the park! 公園に行くなう！ Its language English Japanese

Multi-class Prediction (several choices)

A sentence I read a book Its syntactic parse

Structured Prediction (millions of choices)

I read a book

DET NN NP VBD VP S N

SLIDE 8

8

NLP Programming Tutorial 8 – Recurrent Neural Nets

10
5

5 10 0.5 1 w*phi(x) p(y|x)

Review: Sigmoid Function

The sigmoid softens the step function
10
5

5 10 0.5 1 w*phi(x) p(y|x)

Step Function Sigmoid Function

P( y=1∣x)= e

w⋅ ϕ( x)

1+e

w⋅ϕ(x)

SLIDE 9

9

NLP Programming Tutorial 8 – Recurrent Neural Nets

softmax Function

Sigmoid function for multiple classes
Can be expressed using matrix/vector ops

P( y∣x)= ew⋅ϕ(x , y)

∑~

y e w⋅ϕ(x,~ y)

Current class Sum of other classes

r=exp(W⋅ϕ(x)) p=r/∑~

r ∈r ~

r

SLIDE 10

10

NLP Programming Tutorial 8 – Recurrent Neural Nets

Selecting the Best Value from a Probability Distribution

Find the index y with the highest probability

find_best(p): y = 0 for each element i in 1 .. len(p)-1: if p[i] > p[y]: y = i return y

SLIDE 11

11

NLP Programming Tutorial 8 – Recurrent Neural Nets

softmax Function Gradient

The difference between the true and estimated

probability distributions

The true distribution p' is expressed with a vector with
nly the y-th element 1 (a one-hot vector)

−d err /d ϕout=p'−p p'={0,0,…,1,…,0}

SLIDE 12

12

NLP Programming Tutorial 8 – Recurrent Neural Nets

Creating a 1-hot Vector

create_one_hot(id, size): vec = np.zeros(size) vec[id] = 1 return vec

SLIDE 13

13

NLP Programming Tutorial 8 – Recurrent Neural Nets

Forward Propagation in Recurrent Nets

SLIDE 14

14

NLP Programming Tutorial 8 – Recurrent Neural Nets

Review: Forward Propagation Code

forward_nn(network, φ0) φ= [ φ0 ] # Output of each layer for each layer i in 1 .. len(network): w, b = network[i-1] # Calculate the value based on previous layer φ[i] = np.tanh( np.dot( w, φ[i-1] ) + b ) return φ # Return the values of all layers

SLIDE 15

15

NLP Programming Tutorial 8 – Recurrent Neural Nets

RNN Calculation

ht-1 xt 1

tanh

wr,h wr,x br ht

tanh

xt+1 1 wr,h wr,x br ht+1

softmax

wo,h bo 1 pt

softmax

bo 1 wo,h pt+1

ht=tanh(wr ,h⋅ht−1+wr ,x⋅xt+br) pt=softmax (wo,h⋅ht+bo)

SLIDE 16

16

NLP Programming Tutorial 8 – Recurrent Neural Nets

RNN Forward Calculation

forward_rnn(wr,x, wr,h, br, wo,h, bo, x) h = [ ] # Hidden layers (at time t) p = [ ] # Output probability distributions (at time t) y = [ ] # Output values (at time t) for each time t in 0 .. len(x)-1: if t > 0: h[t] = tanh(wr,xx[t] + wr,hh[t-1] + br) else: h[t] = tanh(wr,xx[t] + br) p[t] = tanh(wo,hh[t] + bo) y[t] = find_max(p[t]) return h, p, y

SLIDE 17

17

NLP Programming Tutorial 8 – Recurrent Neural Nets

Review: Back Propagation in Feed-forward Nets

SLIDE 18

18

NLP Programming Tutorial 8 – Recurrent Neural Nets

Stochastic Gradient Descent

Online training algorithm for probabilistic models

(including logistic regression) w = 0 for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw

In other words
For every training example, calculate the gradient

(the direction that will increase the probability of y)

Move in that direction, multiplied by learning rate α

SLIDE 19

19

NLP Programming Tutorial 8 – Recurrent Neural Nets

10
5

5 10 0.1 0.2 0.3 0.4 w*phi(x) dp(y|x)/dw*phi(x)

Gradient of the Sigmoid Function

Take the derivative of the probability

d d w P( y=1∣x) = d d w e

w⋅ ϕ( x)

1+e

w⋅ϕ(x)

= ϕ(x) e

w⋅ϕ(x)

(1+e

w⋅ϕ(x)) 2

d d w P( y=−1∣x) = d d w (1− e

w⋅ϕ(x)

1+e

w⋅ϕ(x))

= −ϕ(x) e

w⋅ϕ(x)

(1+e

w⋅ϕ(x)) 2

SLIDE 20

20

NLP Programming Tutorial 8 – Recurrent Neural Nets

Learning: Don't Know Derivative for Hidden Units!

For NNs, only know correct tag for last layer

y=1

ϕ(x)

d P( y=1∣x) d w4 =h(x) ew4⋅h(x) (1+e

w 4⋅h(x)) 2

h(x)

d P( y=1∣x) d w1 =? d P( y=1∣x) d w2 =? d P( y=1∣x) d w3 =?

w1 w2 w3 w4

SLIDE 21

21

NLP Programming Tutorial 8 – Recurrent Neural Nets

Answer: Back-Propogation

Calculate derivative w/ chain rule

d P( y=1∣x) d w1 =d P( y=1∣x) d w4 h(x) d w4 h(x) d h1(x) d h1(x) d w1

ew4⋅h(x) (1+e

w 4⋅h(x)) 2

w1,4

Error of next unit (δ4) Weight Gradient of this unit

d P( y=1∣x) wi =d hi(x) d wi ∑j δ j wi, j

In General Calculate i based

n next units j:

SLIDE 22

22

NLP Programming Tutorial 8 – Recurrent Neural Nets

Conceptual Picture

Send errors back through the net

ϕ(x)

w1 w2 w3 w4

δ4 δ3 δ2 δ1

y

SLIDE 23

23

NLP Programming Tutorial 8 – Recurrent Neural Nets

Back Propagation in Recurrent Nets

SLIDE 24

24

NLP Programming Tutorial 8 – Recurrent Neural Nets

What Errors do we Know?

NET NET NET NET

x1 x2 x3 x4 y1 y2 y3 y4

δr,1 δr,2 δr,3 δo,1 δo,2 δo,3 δo,4

We know the output errors δo
Must use back-prop to find recurrent errors δr

SLIDE 25

25

NLP Programming Tutorial 8 – Recurrent Neural Nets

How to Back-Propagate?

Standard back propagation through time (BPTT)
For each δo, calculate n steps of δr
Full gradient calculation
Use dynamic programming to calculate the whole

sequence

SLIDE 26

26

NLP Programming Tutorial 8 – Recurrent Neural Nets

Back Propagation through Time

NET NET NET NET

x1 x2 x3 x4 y1 y2 y3 y4 δo,4

Use only one output error
Stop after n steps (here, n=2)

δ δ δo,3 δ δ δo,2 δ δ δo,1 δ

SLIDE 27

27

NLP Programming Tutorial 8 – Recurrent Neural Nets

Full Gradient Calculation

NET NET NET NET

x1 x2 x3 x4 y1 y2 y3 y4 δo,4

First, calculate whole net result forward
Then, calculate result backwards

δ δo,3 δ δo,2 δ δ δo,1

SLIDE 28

28

NLP Programming Tutorial 8 – Recurrent Neural Nets

BPTT? Full Gradient?

Full gradient:
+ Faster, no time limit
- Must save the result of the whole sequence in memory
BPTT:
+ Only remember the results in the past few steps
- Slower, less accurate for long dependencies

SLIDE 29

29

NLP Programming Tutorial 8 – Recurrent Neural Nets

Vanishing Gradient in Neural Nets

NET NET NET NET

x1 x2 x3 x4 y1 y2 y3 y4 δo,4

δ δ δ δ

med. small tiny very tiny

“Long Short Term Memory” is designed to solve this

SLIDE 30

30

NLP Programming Tutorial 8 – Recurrent Neural Nets

RNN Full Gradient Calculation

gradient_rnn(wr,x, wr,h, br, wo,h, bo, x, h, p, y') initialize Δwr,x, Δwr,h, Δbr, Δwo,h, Δbo δr' = np.zeros(len(br)) # Error from the following time step for each time t in len(x)-1 .. 0: p' = create_one_hot(y'[t]) δo' = p' – p[t] # Output error Δwo,h += np.outer(h[t], δo'); Δbo += δo' # Output gradient δr = np.dot(δ'r, wr,h) + np.dot(δ'o, wo,h) # Backprop δ'r = δr * (1 – h[t]2) # tanh gradient Δwr,x += np.outer(x[t], δr'); Δbr += δr' # Hidden gradient if t != 0: Δwr,h += np.outer(h[t-1], δr'); return Δwr,x, Δwr,h, Δbr, Δwo,h, Δbo

SLIDE 31

31

NLP Programming Tutorial 8 – Recurrent Neural Nets

Weight Update

update_weights(wr,x, wr,h, br, wo,h, bo, Δwr,x, Δwr,h, Δbr, Δwo,h, Δbo, λ) wr,x += λ * Δwr,x wr,h += λ * Δwr,h br += λ * Δbr wo,h += λ * Δwo,h bo += λ * Δbo

SLIDE 32

32

NLP Programming Tutorial 8 – Recurrent Neural Nets

Overall Training Algorithm

# Create features create map x_ids, y_ids, array data for each labeled pair x, y in the data

add (create_ids(x, x_ids), create_ids(y, y_ids) ) to data

initialize net randomly # Perform training for I iterations for each labeled pair x, y' in the feat_lab h, p, y = forward_rnn(net, φ0) Δ= gradient_rnn(net, x, h, y') update_weights(net, Δ, λ) print net to weight_file print x_ids, y_ids to id_file

SLIDE 33

33

NLP Programming Tutorial 8 – Recurrent Neural Nets

Exercise

SLIDE 34

34

NLP Programming Tutorial 8 – Recurrent Neural Nets

Exercise

Create an RNN for sequence labeling
Training train-rnn and testing test-rnn
Test: Same data as POS tagging
Input: test/05-{train,test}-input.txt
Reference: test/05-{train,test}-answer.txt
Train a model with data/wiki-en-train.norm_pos and predict for

data/wiki-en-test.norm

Evaluate the POS performance, and compare with HMM:

script/gradepos.pl data/wiki-en-test.pos my_answer.pos