NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig - - PowerPoint PPT Presentation

nlp programming tutorial 8 recurrent neural nets
SMART_READER_LITE
LIVE PREVIEW

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig - - PowerPoint PPT Presentation

NLP Programming Tutorial 8 Recurrent Neural Nets NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and Technology (NAIST) 1 NLP Programming Tutorial 8 Recurrent Neural Nets Feed Forward Neural


slide-1
SLIDE 1

1

NLP Programming Tutorial 8 – Recurrent Neural Nets

NLP Programming Tutorial 8 - Recurrent Neural Nets

Graham Neubig Nara Institute of Science and Technology (NAIST)

slide-2
SLIDE 2

2

NLP Programming Tutorial 8 – Recurrent Neural Nets

Feed Forward Neural Nets

  • All connections point forward

y

ϕ(x)

  • It is a directed acyclic graph (DAG)
slide-3
SLIDE 3

3

NLP Programming Tutorial 8 – Recurrent Neural Nets

Recurrent Neural Nets (RNN)

  • Part of the node outputs return as input

y

ϕt(x) ht−1

  • Why? It is possible to “memorize”
slide-4
SLIDE 4

4

NLP Programming Tutorial 8 – Recurrent Neural Nets

RNN in Sequence Modeling

NET NET NET NET

x1 x2 x3 x4 y1 y2 y3 y4

slide-5
SLIDE 5

5

NLP Programming Tutorial 8 – Recurrent Neural Nets

Example: POS Tagging

NET NET NET NET

natural language processing is JJ NN NN VBZ

slide-6
SLIDE 6

6

NLP Programming Tutorial 8 – Recurrent Neural Nets

Multi-class Prediction with Neural Networks

slide-7
SLIDE 7

7

NLP Programming Tutorial 8 – Recurrent Neural Nets

Review: Prediction Problems Given x, predict y

A book review Oh, man I love this book! This book is so boring... Is it positive? yes no

Binary Prediction (2 choices)

A tweet On the way to the park! 公園に行くなう! Its language English Japanese

Multi-class Prediction (several choices)

A sentence I read a book Its syntactic parse

Structured Prediction (millions of choices)

I read a book

DET NN NP VBD VP S N

slide-8
SLIDE 8

8

NLP Programming Tutorial 8 – Recurrent Neural Nets

  • 10
  • 5

5 10 0.5 1 w*phi(x) p(y|x)

Review: Sigmoid Function

  • The sigmoid softens the step function
  • 10
  • 5

5 10 0.5 1 w*phi(x) p(y|x)

Step Function Sigmoid Function

P( y=1∣x)= e

w⋅ ϕ( x)

1+e

w⋅ϕ(x)

slide-9
SLIDE 9

9

NLP Programming Tutorial 8 – Recurrent Neural Nets

softmax Function

  • Sigmoid function for multiple classes
  • Can be expressed using matrix/vector ops

P( y∣x)= ew⋅ϕ(x , y)

∑~

y e w⋅ϕ(x,~ y)

Current class Sum of other classes

r=exp(W⋅ϕ(x)) p=r/∑~

r ∈r ~

r

slide-10
SLIDE 10

10

NLP Programming Tutorial 8 – Recurrent Neural Nets

Selecting the Best Value from a Probability Distribution

  • Find the index y with the highest probability

find_best(p): y = 0 for each element i in 1 .. len(p)-1: if p[i] > p[y]: y = i return y

slide-11
SLIDE 11

11

NLP Programming Tutorial 8 – Recurrent Neural Nets

softmax Function Gradient

  • The difference between the true and estimated

probability distributions

  • The true distribution p' is expressed with a vector with
  • nly the y-th element 1 (a one-hot vector)

−d err /d ϕout=p'−p p'={0,0,…,1,…,0}

slide-12
SLIDE 12

12

NLP Programming Tutorial 8 – Recurrent Neural Nets

Creating a 1-hot Vector

create_one_hot(id, size): vec = np.zeros(size) vec[id] = 1 return vec

slide-13
SLIDE 13

13

NLP Programming Tutorial 8 – Recurrent Neural Nets

Forward Propagation in Recurrent Nets

slide-14
SLIDE 14

14

NLP Programming Tutorial 8 – Recurrent Neural Nets

Review: Forward Propagation Code

forward_nn(network, φ0) φ= [ φ0 ] # Output of each layer for each layer i in 1 .. len(network): w, b = network[i-1] # Calculate the value based on previous layer φ[i] = np.tanh( np.dot( w, φ[i-1] ) + b ) return φ # Return the values of all layers

slide-15
SLIDE 15

15

NLP Programming Tutorial 8 – Recurrent Neural Nets

RNN Calculation

ht-1 xt 1

tanh

wr,h wr,x br ht

tanh

xt+1 1 wr,h wr,x br ht+1

softmax

wo,h bo 1 pt

softmax

bo 1 wo,h pt+1

ht=tanh(wr ,h⋅ht−1+wr ,x⋅xt+br) pt=softmax (wo,h⋅ht+bo)

slide-16
SLIDE 16

16

NLP Programming Tutorial 8 – Recurrent Neural Nets

RNN Forward Calculation

forward_rnn(wr,x, wr,h, br, wo,h, bo, x) h = [ ] # Hidden layers (at time t) p = [ ] # Output probability distributions (at time t) y = [ ] # Output values (at time t) for each time t in 0 .. len(x)-1: if t > 0: h[t] = tanh(wr,xx[t] + wr,hh[t-1] + br) else: h[t] = tanh(wr,xx[t] + br) p[t] = tanh(wo,hh[t] + bo) y[t] = find_max(p[t]) return h, p, y

slide-17
SLIDE 17

17

NLP Programming Tutorial 8 – Recurrent Neural Nets

Review: Back Propagation in Feed-forward Nets

slide-18
SLIDE 18

18

NLP Programming Tutorial 8 – Recurrent Neural Nets

Stochastic Gradient Descent

  • Online training algorithm for probabilistic models

(including logistic regression) w = 0 for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw

  • In other words
  • For every training example, calculate the gradient

(the direction that will increase the probability of y)

  • Move in that direction, multiplied by learning rate α
slide-19
SLIDE 19

19

NLP Programming Tutorial 8 – Recurrent Neural Nets

  • 10
  • 5

5 10 0.1 0.2 0.3 0.4 w*phi(x) dp(y|x)/dw*phi(x)

Gradient of the Sigmoid Function

  • Take the derivative of the probability

d d w P( y=1∣x) = d d w e

w⋅ ϕ( x)

1+e

w⋅ϕ(x)

= ϕ(x) e

w⋅ϕ(x)

(1+e

w⋅ϕ(x)) 2

d d w P( y=−1∣x) = d d w (1− e

w⋅ϕ(x)

1+e

w⋅ϕ(x))

= −ϕ(x) e

w⋅ϕ(x)

(1+e

w⋅ϕ(x)) 2

slide-20
SLIDE 20

20

NLP Programming Tutorial 8 – Recurrent Neural Nets

Learning: Don't Know Derivative for Hidden Units!

  • For NNs, only know correct tag for last layer

y=1

ϕ(x)

d P( y=1∣x) d w4 =h(x) ew4⋅h(x) (1+e

w 4⋅h(x)) 2

h(x)

d P( y=1∣x) d w1 =? d P( y=1∣x) d w2 =? d P( y=1∣x) d w3 =?

w1 w2 w3 w4

slide-21
SLIDE 21

21

NLP Programming Tutorial 8 – Recurrent Neural Nets

Answer: Back-Propogation

  • Calculate derivative w/ chain rule

d P( y=1∣x) d w1 =d P( y=1∣x) d w4 h(x) d w4 h(x) d h1(x) d h1(x) d w1

ew4⋅h(x) (1+e

w 4⋅h(x)) 2

w1,4

Error of next unit (δ4) Weight Gradient of this unit

d P( y=1∣x) wi =d hi(x) d wi ∑j δ j wi, j

In General Calculate i based

  • n next units j:
slide-22
SLIDE 22

22

NLP Programming Tutorial 8 – Recurrent Neural Nets

Conceptual Picture

  • Send errors back through the net

ϕ(x)

w1 w2 w3 w4

δ4 δ3 δ2 δ1

y

slide-23
SLIDE 23

23

NLP Programming Tutorial 8 – Recurrent Neural Nets

Back Propagation in Recurrent Nets

slide-24
SLIDE 24

24

NLP Programming Tutorial 8 – Recurrent Neural Nets

What Errors do we Know?

NET NET NET NET

x1 x2 x3 x4 y1 y2 y3 y4

δr,1 δr,2 δr,3 δo,1 δo,2 δo,3 δo,4

  • We know the output errors δo
  • Must use back-prop to find recurrent errors δr
slide-25
SLIDE 25

25

NLP Programming Tutorial 8 – Recurrent Neural Nets

How to Back-Propagate?

  • Standard back propagation through time (BPTT)
  • For each δo, calculate n steps of δr
  • Full gradient calculation
  • Use dynamic programming to calculate the whole

sequence

slide-26
SLIDE 26

26

NLP Programming Tutorial 8 – Recurrent Neural Nets

Back Propagation through Time

NET NET NET NET

x1 x2 x3 x4 y1 y2 y3 y4 δo,4

  • Use only one output error
  • Stop after n steps (here, n=2)

δ δ δo,3 δ δ δo,2 δ δ δo,1 δ

slide-27
SLIDE 27

27

NLP Programming Tutorial 8 – Recurrent Neural Nets

Full Gradient Calculation

NET NET NET NET

x1 x2 x3 x4 y1 y2 y3 y4 δo,4

  • First, calculate whole net result forward
  • Then, calculate result backwards

δ δo,3 δ δo,2 δ δ δo,1

slide-28
SLIDE 28

28

NLP Programming Tutorial 8 – Recurrent Neural Nets

BPTT? Full Gradient?

  • Full gradient:
  • + Faster, no time limit
  • - Must save the result of the whole sequence in memory
  • BPTT:
  • + Only remember the results in the past few steps
  • - Slower, less accurate for long dependencies
slide-29
SLIDE 29

29

NLP Programming Tutorial 8 – Recurrent Neural Nets

Vanishing Gradient in Neural Nets

NET NET NET NET

x1 x2 x3 x4 y1 y2 y3 y4 δo,4

δ δ δ δ

med. small tiny very tiny

  • “Long Short Term Memory” is designed to solve this
slide-30
SLIDE 30

30

NLP Programming Tutorial 8 – Recurrent Neural Nets

RNN Full Gradient Calculation

gradient_rnn(wr,x, wr,h, br, wo,h, bo, x, h, p, y') initialize Δwr,x, Δwr,h, Δbr, Δwo,h, Δbo δr' = np.zeros(len(br)) # Error from the following time step for each time t in len(x)-1 .. 0: p' = create_one_hot(y'[t]) δo' = p' – p[t] # Output error Δwo,h += np.outer(h[t], δo'); Δbo += δo' # Output gradient δr = np.dot(δ'r, wr,h) + np.dot(δ'o, wo,h) # Backprop δ'r = δr * (1 – h[t]2) # tanh gradient Δwr,x += np.outer(x[t], δr'); Δbr += δr' # Hidden gradient if t != 0: Δwr,h += np.outer(h[t-1], δr'); return Δwr,x, Δwr,h, Δbr, Δwo,h, Δbo

slide-31
SLIDE 31

31

NLP Programming Tutorial 8 – Recurrent Neural Nets

Weight Update

update_weights(wr,x, wr,h, br, wo,h, bo, Δwr,x, Δwr,h, Δbr, Δwo,h, Δbo, λ) wr,x += λ * Δwr,x wr,h += λ * Δwr,h br += λ * Δbr wo,h += λ * Δwo,h bo += λ * Δbo

slide-32
SLIDE 32

32

NLP Programming Tutorial 8 – Recurrent Neural Nets

Overall Training Algorithm

# Create features create map x_ids, y_ids, array data for each labeled pair x, y in the data

add (create_ids(x, x_ids), create_ids(y, y_ids) ) to data

initialize net randomly # Perform training for I iterations for each labeled pair x, y' in the feat_lab h, p, y = forward_rnn(net, φ0) Δ= gradient_rnn(net, x, h, y') update_weights(net, Δ, λ) print net to weight_file print x_ids, y_ids to id_file

slide-33
SLIDE 33

33

NLP Programming Tutorial 8 – Recurrent Neural Nets

Exercise

slide-34
SLIDE 34

34

NLP Programming Tutorial 8 – Recurrent Neural Nets

Exercise

  • Create an RNN for sequence labeling
  • Training train-rnn and testing test-rnn
  • Test: Same data as POS tagging
  • Input: test/05-{train,test}-input.txt
  • Reference: test/05-{train,test}-answer.txt
  • Train a model with data/wiki-en-train.norm_pos and predict for

data/wiki-en-test.norm

  • Evaluate the POS performance, and compare with HMM:

script/gradepos.pl data/wiki-en-test.pos my_answer.pos