RNN Input Layer RNN Hidden Layer RNN h t-1 h t x t (Picture - - PowerPoint PPT Presentation

rnn input layer rnn hidden layer
SMART_READER_LITE
LIVE PREVIEW

RNN Input Layer RNN Hidden Layer RNN h t-1 h t x t (Picture - - PowerPoint PPT Presentation

Outline Recurrent Neural Networks (RNNs) Neural Machine Transla/on NMT basics (Sutskever et al., 2014) ABenCon mechanism (Bahdanau et al., 2015) Marcello Federico 2016 Recurrent Neural Networks (RNNs) Based on slides kindly provided


slide-1
SLIDE 1

Neural Machine Transla/on

Marcello Federico 2016 Based on slides kindly provided by Thang Luong, Stanford U.

Outline

  • Recurrent Neural Networks (RNNs)
  • NMT basics (Sutskever et al., 2014)
  • ABenCon mechanism (Bahdanau et al., 2015)

Recurrent Neural Networks (RNNs)

(Picture adapted from Andrej Karparthy)

RNN

slide-2
SLIDE 2

RNN – Input Layer

(Picture adapted from Andrej Karparthy)

RNN

RNN – Hidden Layer

(Picture adapted from Andrej Karparthy)

ht-1 ht xt

RNN – Hidden Layer

(Picture adapted from Andrej Karparthy)

ht-1 ht xt

RNNs to represent sequences!

Outline

  • Recurrent Neural Networks (RNNs)
  • NMT basics (Sutskever et al., 2014)

– Encoder-Decoder. – Training vs. TesCng. – BackpropagaCon. – More about RNNs.

  • ABenCon mechanism (Bahdanau et al., 2015)
slide-3
SLIDE 3

Neural Machine Transla/on (NMT)

  • Model P(target | source) directly.

am a student _ Je suis étudiant Je suis étudiant _ I

Neural Machine Transla/on (NMT)

  • RNNs trained end-to-end (Sutskever et al., 2014).

am a student _ Je suis étudiant Je suis étudiant _ I

Neural Machine Transla/on (NMT)

  • RNNs trained end-to-end (Sutskever et al., 2014).

am a student _ Je suis étudiant Je suis étudiant _ I

Neural Machine Transla/on (NMT)

  • RNNs trained end-to-end (Sutskever et al., 2014).

am a student _ Je suis étudiant Je suis étudiant _ I

slide-4
SLIDE 4

Neural Machine Transla/on (NMT)

  • RNNs trained end-to-end (Sutskever et al., 2014).

am a student _ Je suis étudiant Je suis étudiant _ I

Neural Machine Transla/on (NMT)

  • RNNs trained end-to-end (Sutskever et al., 2014).

am a student _ Je suis étudiant Je suis étudiant _ I

Neural Machine Transla/on (NMT)

  • RNNs trained end-to-end (Sutskever et al., 2014).

am a student _ Je suis étudiant Je suis étudiant _ I

Neural Machine Transla/on (NMT)

  • RNNs trained end-to-end (Sutskever et al., 2014).

am a student _ Je suis étudiant Je suis étudiant _ I

slide-5
SLIDE 5

Neural Machine Transla/on (NMT)

  • RNNs trained end-to-end (Sutskever et al., 2014).
  • Encoder-decoder approach.

am a student _ Je suis étudiant Je suis étudiant _ I

Encoder Decoder

Word Embeddings

am a student _ Je suis étudiant Je suis étudiant _ I

  • Randomly iniCalized, one for each language.

– Learnable parameters.

Source embeddings Target embeddings

am a student _ Je suis étudiant Je suis étudiant _ I

Recurrent Connec/ons

IniCal states

  • OYen set to 0.

am a student _ Je suis étudiant Je suis étudiant _ I

Recurrent Connec/ons

  • Different across layers and encoder / decoder.

Encoder 1st layer

slide-6
SLIDE 6

am a student _ Je suis étudiant Je suis étudiant _ I

Recurrent Connec/ons

  • Different across layers and encoder / decoder.

Encoder 2nd layer

am a student _ Je suis étudiant Je suis étudiant _ I

Recurrent Connec/ons

  • Different across layers and encoder / decoder.

Decoder 1st layer

am a student _ Je suis étudiant Je suis étudiant _ I

Recurrent Connec/ons

  • Different across layers and encoder / decoder.

Decoder 2nd layer

Outline

  • Recurrent Neural Networks (RNNs)
  • NMT basics (Sutskever et al., 2014)

– Encoder-Decoder. – Training vs. TesCng. – BackpropagaCon. – More about RNNs.

  • ABenCon mechanism (Bahdanau et al., 2015)
slide-7
SLIDE 7

Training vs. Tes/ng

  • Training

– Correct translaCons are available.

  • Tes)ng

– Only source sentences are given.

am a student _ Je suis étudiant Je suis étudiant _ I am a student _ Je suis étudiant I

Je suis étudiant _

Training – So1max

  • Hidden states scores.

Scores

Je suis suis étudiant

= |V|

SoYmax parameters

Je suis étudiant _

Training – So1max

  • Scores probabiliCes.

Je suis suis étudiant

= Scores so+max func)on Probs

P(suis | Je, source)

|V|

SoYmax parameters

Training Loss

  • Maximize P(target | source):

– Decompose into individual word predicCons.

am a student _ Je suis étudiant I Je suis étudiant _

slide-8
SLIDE 8

Training Loss

am a student _ Je suis étudiant

  • log P(Je)

I

  • log P(suis)
  • Sum of all individual losses

Training Loss

am a student _ Je suis étudiant

  • log P(Je)

I

  • log P(suis)
  • Sum of all individual losses

Training Loss

  • Sum of all individual losses

am a student _ Je suis étudiant I

  • log P(étudiant)

Training Loss

  • Sum of all individual losses

am a student _ Je suis étudiant I

  • log P(_)
slide-9
SLIDE 9

Tes/ng

  • Feed the most likely word

Tes/ng

  • Feed the most likely word

Tes/ng

  • Feed the most likely word

Tes/ng

  • Feed the most likely word
slide-10
SLIDE 10

Tes/ng

NMT beam-search decoders are much simpler!

Possible beam search decoder Outline

  • Recurrent Neural Networks (RNNs)
  • NMT basics (Sutskever et al., 2014)

– Encoder-Decoder. – Training vs. TesCng. – BackpropagaCon. – More about RNNs.

  • ABenCon mechanism (Bahdanau et al., 2015)

Backpropaga/on Through Time

am a student _ Je suis étudiant I

  • log P(_)

Init to 0

slide-11
SLIDE 11

am a student _ Je suis étudiant I

  • log P(étudiant)

Backpropaga/on Through Time

am a student _ Je suis étudiant I

  • log P(étudiant)

Backpropaga/on Through Time

am a student _ Je suis étudiant I

  • log P(suis)

Backpropaga/on Through Time

am a student _ Je suis étudiant I

  • log P(suis)

Backpropaga/on Through Time

slide-12
SLIDE 12

am a student _ Je suis étudiant I

Backpropaga/on Through Time

RNN gradients are accumulated.

Outline

  • Recurrent Neural Networks (RNNs)
  • NMT basics (Sutskever et al., 2014)

– Encoder-Decoder. – Training vs. TesCng. – BackpropagaCon. – More about RNNs.

  • ABenCon mechanism (Bahdanau et al., 2015)

Recurrent types – vanilla RNN

RNN

Vanishing gradient problem!

Vanishing gradients

(Pascanu et al., 2013) Chain Rule Bound Rules Bound Largest singular value

slide-13
SLIDE 13

Vanishing gradients

Chain Rule Sufficient Cond (Pascanu et al., 2013) Chain Rule Bound Rules

Vanishing gradients

Chain Rule Sufficient Cond (Pascanu et al., 2013) Bound Rules

Recurrent types – LSTM

  • Long-Short Term Memory (LSTM)

– (Hochreiter & Schmidhuber, 1997)

  • LSTM cells are addiCvely updated

– Make backprop through Cme easier.

C’mon, it’s been around for 20 years!

LSTM

LSTM cells

Building LSTM

  • A naïve version.

LSTM

Nice gradients!

slide-14
SLIDE 14

Building LSTM

  • Add input gates: control input signal.

Input gates

LSTM

Building LSTM

  • Add forget gates: control memory.

Forget gates

LSTM

Building LSTM

Output gates

  • Add output gates: extract informaCon.
  • (Zaremba et al., 2014).

LSTM

Why LSTM works?

  • The addiCve operaCon is the key!
  • BackpropaCon path through the cell is effecCve.

LSTMt LSTMt-1 +

slide-15
SLIDE 15

Why LSTM works?

  • The addiCve operaCon is the key!
  • BackpropaCon path through the cell is effecCve.

LSTMt LSTMt-1 +

Forget gates are important!

  • (Graves, 2013): revived LSTM.

– Direct connecCons between cells and gates.

  • Gated Recurrent Unit (GRU) – (Cho et al., 2014a)

– No cells, same addiCve idea.

  • LSTM vs. GRU: mixed results (Chung et al., 2015).

Other RNN units Summary

Deep RNNs (Sutskever et al., 2014)

am a student _ Je suis étudiant Je suis étudiant _ I

am a student _ Je suis étudiant Je suis étudiant _ I

BidirecConal RNNs (Bahdanau et al., 2015)

  • Generalize well.
  • Small memory.
  • Simple decoder.
slide-16
SLIDE 16

Outline

  • Recurrent Neural Networks (RNNs)
  • NMT basics
  • ABenCon mechanism
  • Rare words

Sentence Length Problem

With aBenCon Without aBenCon

(Bahdanau et al., 2015)

Why?

  • A fixed-dimensional source vector.
  • Problem: Markovian process.

am a student _ Je suis étudiant Je suis étudiant _ I

AQen/on Mechanism

  • SoluCon: random access memory

– Retrieve as needed. – cf. Neural Turing Machine (Graves et al., 2014).

am a student _ Je suis étudiant Je suis étudiant _ I

Pool of source states

slide-17
SLIDE 17

Alignments as a by-product

  • Recent innovaCon in deep learning:

– Control problem (Mnih et al., 14) – Speech recogniCon (Chorowski et al., 15) – Image capCon generaCon (Xu et al., 15)

(Bahdanau et al., 2015)

Simplified ABenCon (Bahdanau et al., 2015) Deep LSTM (Sutskever et al., 2014)

am a student _ Je suis étudiant Je suis étudiant _ I

+

am a student _ Je suis I Attention Layer Context vector

? What’s next?

am a student _ Je suis étudiant Je suis étudiant _ I

  • Compare target and source hidden states.

AQen/on Mechanism – Scoring

am a student _ Je suis I Attention Layer Context vector

?

slide-18
SLIDE 18
  • Compare target and source hidden states.

AQen/on Mechanism – Scoring

am a student _ Je suis I Attention Layer Context vector

? 3

  • Compare target and source hidden states.

AQen/on Mechanism – Scoring

am a student _ Je suis I Attention Layer Context vector

? 5 3

  • Compare target and source hidden states.

AQen/on Mechanism – Scoring

am a student _ Je suis I Attention Layer Context vector

? 1 3 5

  • Compare target and source hidden states.

AQen/on Mechanism – Scoring

am a student _ Je suis I Attention Layer Context vector

? 1 3 5 1

slide-19
SLIDE 19
  • Convert into alignment weights.

AQen/on Mechanism – Normaliza9on

am a student _ Je suis I Attention Layer Context vector

? 0.1 0.3 0.5 0.1

am a student _ Je suis I Context vector

  • Build context vector: weighted average.

AQen/on Mechanism – Context vector

?

am a student _ Je suis I Context vector

  • Compute the next hidden state.

AQen/on Mechanism – Hidden state

am a student _ Je suis I Context vector

  • Predict the next word.

AQen/on Mechanism – Predict

slide-20
SLIDE 20

AQen/on Mechanism – Score Func9ons

(Bahdanau et al., 2015) (Luong et al., 2015b)

AQen/on Mechanism – Score Func9ons

  • More focused aBenCon (Luong et al., 2015b)

– Focus on a subset of words each Cme.

(Bahdanau et al., 2015) (Luong et al., 2015b)

Empirical Evidence

10 20 30 40 50 60 70 10 15 20 25 Sent Lengths BLEU

  • urs, no attn (BLEU 13.9)
  • urs, local−p attn (BLEU 20.9)
  • urs, best system (BLEU 23.0)

WMT’14 best (BLEU 20.7) Jeans et al., 2015 (BLEU 21.6)

No ABenCon ABenCon (Luong et al., 2015b)

Outline

  • Recurrent Neural Networks (RNNs)
  • NMT basics
  • ABenCon mechanism
  • Rare words
slide-21
SLIDE 21

Rare Word Problem

The ecotax porCco in Pont-de-Buis Le porCque écotaxe de Pont-de-Buis Parallel corpus The <unk> porCco in <unk> Le <unk> <unk> de <unk> Actual input

  • NMT vocabs are modest in size, e.g., 50K.

– Simple soYmax. – GPU friendliness.

Empirical Evidence

25 30 35 40 BLEU Sentences ordered by average frequency rank Durrani et al. (37.0) Sutskever et al. (34.8)

Phrase-based MT Neural MT

Pre-/Post- Processing Approach

  • Treat any NMT as a black box.
  • Simple:

– Annotate training data. – Post-process translaCons.

  • State-of-the-art WMT English-French systems.

Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals, and Wojciech Zaremba. Addressing the Rare Word Problem in Neural Machine Transla9on. ACL 2015.

  • First, learn unsupervised alignments.

The ecotax porCco in Pont-de-Buis Le porCque écotaxe de Pont-de-Buis Parallel corpus The <unk> porCco in <unk> Le <unk> <unk> de <unk> Actual input

Pre-/Post- Processing Approach

slide-22
SLIDE 22
  • Add posiConal informaCon to target unkp.

The ecotax porCco in Pont-de-Buis Le porCque écotaxe de Pont-de-Buis Parallel corpus The <unk> porCco in <unk> Le unk1 unk-1 de unk0 Actual input

Pre-/Post- Processing Approach Pre-/Post- Processing Approach

  • Add posiConal informaCon to target unkp:

– RelaCve distances

The ecotax porCco in Pont-de-Buis Le porCque écotaxe de Pont-de-Buis Parallel corpus The <unk> porCco in <unk> Le unk1 unk-1 de unk0 Actual input 3 2

Pre-/Post- Processing Approach

  • Add posiConal informaCon to target unkp:

– RelaCve distances

The ecotax porCco in Pont-de-Buis Le porCque écotaxe de Pont-de-Buis Parallel corpus The <unk> porCco in <unk> Le unk1 unk-1 de unk0 Actual input 2 3

Pre-/Post- Processing Approach

  • Add posiConal informaCon to target unkp:

– RelaCve distances

The ecotax porCco in Pont-de-Buis Le porCque écotaxe de Pont-de-Buis Parallel corpus The <unk> porCco in <unk> Le unk1 unk-1 de unk0 Actual input 5 5

“ABenCon” for rare words

slide-23
SLIDE 23

Pre-/Post- Processing Approach

The ecotax porCco in Pont-de-Buis (The <unk> porCco in <unk>) Test sentence Le unk1 unk-1 de unk0 Transla)on

Pre-/Post- Processing Approach

  • Word translaCon:

– “DicConary” extracted from alignments.

The ecotax porCco in Pont-de-Buis (The <unk> porCco in <unk>) Test sentence Le unk1 unk-1 de unk0 Transla)on

Pre-/Post- Processing Approach

  • IdenCty translaCon

The ecotax porCco in Pont-de-Buis (The <unk> porCco in <unk>) Test sentence Le unk1 unk-1 de unk0 Transla)on

Sample English-German transla/ons

  • Translate names correctly.

src Orlando Bloom and Miranda Kerr sCll love each other ref Orlando Bloom und Miranda Kerr lieben sich noch immer best Orlando Bloom und Miranda Kerr lieben einander noch immer . base Orlando Bloom und Lucas Miranda lieben einander noch immer .

(Luong et al., 2015b)

slide-24
SLIDE 24

Summary

Simplified ABenCon (Bahdanau et al., 2015) Deep LSTM (Sutskever et al., 2014)

am a student _ Je suis étudiant Je suis étudiant _ I

+ Encoder-decoder Summary

Encoder Decoder (Sutskever et al., 2014) (Luong et al., 2015a) (Luong et al., 2015b) Deep LSTM Deep LSTM (Cho et al., 2014a) (Bahdanau et al., 2015) (Jean et al., 2015) (BidirecConal) GRU GRU (Kalchbrenner & Blunsom, 2013) CNN (Inverse CNN) RNN (Cho et al., 2014b) Gated Recursive CNN GRU

References (1)

  • [Bahdanau et al., 2015] Neural TranslaCon by Jointly Learning to Align and
  • Translate. hBp://arxiv.org/pdf/1409.0473.pdf
  • [Cho et al., 2014a] Learning Phrase RepresentaCons using RNN Encoder–Decoder

for StaCsCcal Machine TranslaCon. hBp://aclweb.org/anthology/D/D14/D14-1179.pdf

  • [Cho et al., 2014b] On the ProperCes of Neural Machine TranslaCon: Encoder–

Decoder Approaches. hBp://www.aclweb.org/anthology/W14-4012

  • [Chorowski et al., 2015] ABenCon-Based Models for Speech RecogniCon.

hBp://arxiv.org/pdf/1506.07503v1.pdf

  • [Chung et al., 2015] Empirical EvaluaCon of Gated Recurrent Neural Networks on

Sequence Modeling. hBp://arxiv.org/pdf/1412.3555.pdf

  • [Graves, 2013] GeneraCng Sequences With Recurrent Neural Networks.

hBp://arxiv.org/pdf/1308.0850v5.pdf

  • [Graves, 2014] Neural Turing Machine. hBp://arxiv.org/pdf/1410.5401v2.pdf.
  • [Hochreiter & Schmidhuber, 1997] Long Short-term Memory.

hBp://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf

References (2)

  • [Kalchbrenner & Blunsom, 2013] Recurrent ConCnuous TranslaCon Models.

hBp://nal.co/papers/KalchbrennerBlunsom_EMNLP13

  • [Luong et al., 2015a] Addressing the Rare Word Problem in Neural Machine
  • TranslaCon. hBp://www.aclweb.org/anthology/P15-1002
  • [Luong et al., 2015b] EffecCve Approaches to ABenCon-based Neural Machine
  • TranslaCon. hBps://aclweb.org/anthology/D/D15/D15-1166.pdf
  • [Mnih et al., 2014] Recurrent Models of Visual ABenCon.

hBp://papers.nips.cc/paper/5542-recurrent-models-of-visual-aBenCon.pdf

  • [Pascanu et al., 2013] On the difficulty of training Recurrent Neural Networks.

hBp://arxiv.org/pdf/1211.5063v2.pdf

  • [Xu et al., 2015] Show, ABend and Tell: Neural Image CapCon GeneraCon with

Visual ABenCon. hBp://jmlr.org/proceedings/papers/v37/xuc15.pdf

  • [Sutskever et al., 2014] Sequence to Sequence Learning with Neural Networks.

hBp://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural- networks.pdf

  • [Zaremba et al., 2015] Recurrent Neural Network RegularizaCon.

hBp://arxiv.org/pdf/1409.2329.pdf