Neural Machine Translation Gongbo Tang 8 October 2018 Outline - - PowerPoint PPT Presentation

neural machine translation
SMART_READER_LITE
LIVE PREVIEW

Neural Machine Translation Gongbo Tang 8 October 2018 Outline - - PowerPoint PPT Presentation

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1 Advances and Challenges 2 Gongbo Tang Neural Machine Translation 2/52 Neural Machine Translation Figure Recurrent neural network based NMT


slide-1
SLIDE 1

Neural Machine Translation

Gongbo Tang

8 October 2018

slide-2
SLIDE 2

Outline

1

Neural Machine Translation

2

Advances and Challenges

Gongbo Tang Neural Machine Translation 2/52

slide-3
SLIDE 3

Neural Machine Translation

Figure – Recurrent neural network based NMT model

From Thang Luong’s Thesis on Neural Machine Translation Gongbo Tang Neural Machine Translation 3/52

slide-4
SLIDE 4

Neural Machine Translation

Figure – NMT model with attention mechanism

Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 4/52

slide-5
SLIDE 5

Modelling Translation

Suppose that we have:

a source sentence S of length m (x1, . . . , xm) a target sentence T of length n (y1, . . . , yn)

We can express translation as a probabilistic model

T ∗ = arg max

T

p(T|S)

Expanding using the chain rule gives

p(T|S) = p(y1, . . . , yn|x1, . . . , xm) =

n

  • i=1

p(yi|y1, . . . , yi−1, x1, . . . , xm)

Gongbo Tang Neural Machine Translation 5/52

slide-6
SLIDE 6

Modelling Translation

Target-side language model:

p(T) =

n

  • i=1

p(yi|y1, . . . , yi−1)

Translation model:

p(T|S) =

n

  • i=1

p(yi|y1, . . . , yi−1, x1, . . . , xm)

We could just treat sentence pair as one long sequence, but:

We do not care about p(S) We may want different vocabulary, network architecture for source text

Gongbo Tang Neural Machine Translation 6/52

slide-7
SLIDE 7

Attentional Encoder-Decoder : Maths

simplifications of model by [Bahdanau et al., 2015] (for illustration)

plain RNN instead of GRU simpler output layer we do not show bias terms decoder follows Look, Update, Generate strategy [Sennrich et al., 2017] Details in https://github.com/amunmt/amunmt/blob/master/contrib/notebooks/dl4mt.ipynb

notation W, U, E, C, V are weight matrices (of different dimensionality)

E one-hot to embedding (e.g. 50000 · 512) W embedding to hidden (e.g. 512 · 1024) U hidden to hidden (e.g. 1024 · 1024) C context (2x hidden) to hidden (e.g. 2048 · 1024) Vo hidden to one-hot (e.g. 1024 · 50000) separate weight matrices for encoder and decoder (e.g. Ex and Ey) input X of length Tx; output Y of length Ty

Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 7/52

slide-8
SLIDE 8

Encoder

Figure – NMT model with attention mechanism

Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 8/52

slide-9
SLIDE 9

Encoder

Encoder with bidirectional Recurent Neural Networks

Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN

Gongbo Tang Neural Machine Translation 9/52

slide-10
SLIDE 10

Encoder

Encoder with bidirectional Recurent Neural Networks

Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN

− → h j =

  • 0,

, if j = 0 tanh(− → W xExxj + − → U xhj−1) , if j > 0 ← − h j =

  • 0,

, if j = Tx + 1 tanh(← − W xExxj + ← − U xhj+1) , if j ≤ Tx hj = (− → h j, ← − h j)

Gongbo Tang Neural Machine Translation 9/52

slide-11
SLIDE 11

Decoder

Figure – NMT model with attention mechanism

Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 10/52

slide-12
SLIDE 12

Decoder

Context State

ti-1 ti

Word Prediction

yi-1 Eyi-1

Selected Word

yi Eyi

Embedding

si si-1 ci ci-1

Gongbo Tang Neural Machine Translation 11/52

slide-13
SLIDE 13

Decoder

Context State

ti-1 ti

Word Prediction

yi-1 Eyi-1

Selected Word

yi Eyi

Embedding

si si-1 ci ci-1

si =

  • tanh(Ws

← − h i),

, if i = 0 tanh(WyEyyi−1 + Uysi−1 + Cci) , if i > 0

ti = tanh(Uosi + WoEyyi−1 + Coci) yi = softmax(Voti)

Gongbo Tang Neural Machine Translation 11/52

slide-14
SLIDE 14

Decoder

Training yi is known, the training objective is to assign higher possibility to the correct output word. One of the cost functions is the negative log of the probability given to the correct word transaltion. cost = −logti[yi]

Gongbo Tang Neural Machine Translation 12/52

slide-15
SLIDE 15

Decoder

Training yi is known, the training objective is to assign higher possibility to the correct output word. One of the cost functions is the negative log of the probability given to the correct word transaltion. cost = −logti[yi] Inference yi is unknown, we compute the probability distribution over all the vocabulary. Greedy search : select the word with the highest probability. Beam search : keep the top k most likely word choices.

Gongbo Tang Neural Machine Translation 12/52

slide-16
SLIDE 16

Decoding

! 0.928 0.175 <eos> 0.999 0.175 hello 0.946 0.056 world 0.957 0.100

Greedy search

Gongbo Tang Neural Machine Translation 13/52

slide-17
SLIDE 17

Decoding

! 0.928 0.175 <eos> 0.999 0.175 hello 0.946 0.056 world 0.957 0.100

Greedy search

hello 0.946 0.056 world 0.957 0.100 World 0.010 4.632 . 0.030 3.609 ! 0.928 0.175 ... 0.014 4.384 <eos> 0.999 3.609 world 0.684 5.299 HI 0.007 4.920 <eos> 0.994 4.390 Hey 0.006 5.107 <eos> 0.999 0.175

K = 3

Beam search

Gongbo Tang Neural Machine Translation 13/52

slide-18
SLIDE 18

Attention

Figure – NMT model with attention mechanism

Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 14/52

slide-19
SLIDE 19

Attention

αi h Encoder hidden states Attention Decoder Networks Softmax Predictions si−1 ci Gongbo Tang Neural Machine Translation 15/52

slide-20
SLIDE 20

Attention

αi h Encoder hidden states Attention Decoder Networks Softmax Predictions si−1 ci

eij = v⊤

a tanh(Wasi−1 + Uahj)

αij = softmax(eij) ci =

Tx

  • j=1

αijhj

Gongbo Tang Neural Machine Translation 15/52

slide-21
SLIDE 21

Overview of NMT

Pros More fluent translation Less lexical errors Less word order errors Less morphology errors Cons Expensive computation Over-translation and under-translation (Adequacy) Bad at translating long sentences Need more data Black box

Gongbo Tang Neural Machine Translation 16/52

slide-22
SLIDE 22

Overview of NMT

Pros More fluent translation Less lexical errors Less word order errors Less morphology errors Cons Expensive computation Over-translation and under-translation (Adequacy) Bad at translating long sentences Need more data Black box

Gongbo Tang Neural Machine Translation 16/52

slide-23
SLIDE 23

Advances and Challenges

Attention mechanism Model Architectures Monolingual data Models at different levels Linguistic features Modelling coverage Domain adaption Transfer learning what is NMT not good at ?

Gongbo Tang Neural Machine Translation 17/52

slide-24
SLIDE 24

Advances and Challenges

Attention mechanism Model Architectures Monolingual data Models at different levels Linguistic features Modelling coverage Domain adaption Transfer learning what is NMT not good at ?

Gongbo Tang Neural Machine Translation 17/52

slide-25
SLIDE 25

Advances and Challenges

Attention mechanism Model Architectures Monolingual data Models at different levels Linguistic features Modelling coverage Domain adaption Transfer learning what is NMT not good at ?

Gongbo Tang Neural Machine Translation 17/52

slide-26
SLIDE 26

Advances and Challenges

Attention mechanism Model Architectures Monolingual data Models at different levels Linguistic features Modelling coverage Domain adaption Transfer learning what is NMT not good at ?

Gongbo Tang Neural Machine Translation 17/52

slide-27
SLIDE 27

Advances and Challenges

Attention mechanism Model Architectures Monolingual data Models at different levels Linguistic features Modelling coverage Domain adaption Transfer learning what is NMT not good at ?

Gongbo Tang Neural Machine Translation 17/52

slide-28
SLIDE 28

Advances and Challenges

Attention mechanism Model Architectures Monolingual data Models at different levels Linguistic features Modelling coverage Domain adaption Transfer learning what is NMT not good at ?

Gongbo Tang Neural Machine Translation 17/52

slide-29
SLIDE 29

Advances and Challenges

Attention mechanism Model Architectures Monolingual data Models at different levels Linguistic features Modelling coverage Domain adaption Transfer learning what is NMT not good at ?

Gongbo Tang Neural Machine Translation 17/52

slide-30
SLIDE 30

Advances and Challenges

Attention mechanism Model Architectures Monolingual data Models at different levels Linguistic features Modelling coverage Domain adaption Transfer learning what is NMT not good at ?

Gongbo Tang Neural Machine Translation 17/52

slide-31
SLIDE 31

Advances and Challenges

Attention mechanism Model Architectures Monolingual data Models at different levels Linguistic features Modelling coverage Domain adaption Transfer learning what is NMT not good at ?

Gongbo Tang Neural Machine Translation 17/52

slide-32
SLIDE 32

Attention Mechanism and Alignment

From What does Attention in Neural Machine Translation Pay Attention to ? Gongbo Tang Neural Machine Translation 18/52

slide-33
SLIDE 33

Attention Mechanism and Alignment

POS tag attention to alignment points % attention to

  • ther words %

NUM 73 27 NOUN 68 32 ADJ 66 34 PUNC 55 45 ADV 50 50 CONJ 50 50 VERB 49 51 ADP 47 53 DET 45 55 PRON 45 55 PRT 36 64 Overall 54 46

Table 8: Distribution of attention probability mass (in %) over alignment points and the rest of the words for each POS tag.

From What does Attention in Neural Machine Translation Pay Attention to ? Gongbo Tang Neural Machine Translation 19/52

slide-34
SLIDE 34

Attention Mechanism and Alignment

relations between Obama and Netanyahu have been strained for years . die Beziehungen zwischen Obama und Netanjahu sind seit Jahren angespannt . 56 89 72 16 26 96 79 98 42 11 11 14 38 22 84 23 54 10 98 49 das Verhältnis zwischen Obama und Netanyahu ist seit Jahren gespannt . the relationship between Obama and Netanyahu has been stretched for years . 11 47 81 72 87 93 95 38 21 17 16 14 38 19 33 90 32 26 54 77 12 17

(a) Desired Alignment (b) Mismatched Alignment

Gongbo Tang Neural Machine Translation 20/52

slide-35
SLIDE 35

Attention and Unknown Words

  • 1. Find the corresponding

source word

  • 2. Lookup dictionary for

translation

  • Example from Zhaopeng Tu’s Tutorial,

method from On Using Very Large Target Vocabulary for Neural Machine Translation Gongbo Tang Neural Machine Translation 21/52

slide-36
SLIDE 36

Attention Mechanisms

αt h Encoder hidden states Attention Decoder Networks Softmax Predictions st−1 ct

(a) Vanilla attention mechanism

αt h Encoder hidden states Multi-head Attention Decoder Block Decoder Block Softmax Predictions st−1 ct st−1 ct

n-layer blocks

(b) Advanced attention mechanism Gongbo Tang Neural Machine Translation 22/52

slide-37
SLIDE 37

Attention Mechanisms

αt h Encoder hidden states Attention Decoder Networks Softmax Predictions st−1 ct

(a) Vanilla attention mechanism

αt h Encoder hidden states Multi-head Attention Decoder Block Decoder Block Softmax Predictions st−1 ct st−1 ct

n-layer blocks

(b) Advanced attention mechanism Multi-Head Attention Gongbo Tang Neural Machine Translation 22/52

slide-38
SLIDE 38

Transformer Attention and Vanilla Attention

w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s> w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s> w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s>

(a) Layer 1 (b) Layer 2 (c) Layer 3

wer garantiert den Leuten eine Stelle ? who guarantees people a job ? </s> wer garantiert den Leuten eine Stelle ? who guarantees people a job ? </s> wer garantiert den Leuten eine Stelle ? who guarantees people a job ? </s>

(d) Layer 4 (e) Layer 5 (f) Layer 6

wer garantiert den Leuten eine Stelle ? who guarantees people a job ? </s> wer garantiert den Leuten eine Stelle ? who guarantees people a job ? </s> wer garantiert den Leuten eine Stelle ? who guarantees people a job ? </s>

(g) Layer 7 (h) Layer 8 (i) RNN

Gongbo Tang Neural Machine Translation 23/52

slide-39
SLIDE 39

Transformer Attention and Vanilla Attention

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 L1 L2 L3 L4 L5 L6 L7 L8 Average attention weight Ambi. Nouns Tokens

Figure – Attention weights in different Transformer attention Layers

Gongbo Tang Neural Machine Translation 24/52

slide-40
SLIDE 40

Fine-grained Attention Mechanism

(a) (b) Figure 1: (a) The conventional attention mechanism and (b) The proposed fine-grained attention

  • mechanism. Note that

t αt′,t = 1 in the conventional method, and t αd t′,t = 1 for all dimension

d in the proposed method.

Fine-Grained Attention Mechanism for Neural Machine Translation Gongbo Tang Neural Machine Translation 25/52

slide-41
SLIDE 41

Model Architectures

Neural networks Vanilla RNN LSTM/GRU CNN Self-attention

x1 x2 x3 x4 x5 (a) RNN x1 x2 x3 x4 x5

padding padding

(b) CNN x1 x2 x3 x4 x5 (c) Self-attention

Gongbo Tang Neural Machine Translation 26/52

slide-42
SLIDE 42

Model Architectures

Encoder-decoders With residual feed-forward layers Cascaded encoder Multi-column encoder

Gongbo Tang Neural Machine Translation 27/52

slide-43
SLIDE 43

Model Architectures

Encoder-decoders With residual feed-forward layers Cascaded encoder Multi-column encoder

(a) Cascaded Encoder (b) Multi-Column Encoder Gongbo Tang Neural Machine Translation 27/52

slide-44
SLIDE 44

Character-level Models

MT is an open-vocabulary problem

compounding and other productive morphological processes

they charge a carry-on bag fee. sie erheben eine Hand|gepäck|gebühr.

names

Obama(English; German) Обама (Russian) オバマ (o-ba-ma) (Japanese)

technical terms, numbers, etc.

But, limited vocabulary causes out-of-vocabulary words.

Gongbo Tang Neural Machine Translation 28/52

slide-45
SLIDE 45

Character-level Models

Character-level models small vocabulary do not require segmentation/tokenization can model different, rare mopholigical variants of a word

Gongbo Tang Neural Machine Translation 29/52

slide-46
SLIDE 46

Character-level Models

Character-level models small vocabulary do not require segmentation/tokenization can model different, rare mopholigical variants of a word sentence length is longer (harder for training) training time is longer

Gongbo Tang Neural Machine Translation 29/52

slide-47
SLIDE 47

Character-level Models

* C2W Compositional Model BLSTM W h e r e Word Vector for "Where"

Figure – C2W Compositional model

From Chracter-based Neural Machine Translation Gongbo Tang Neural Machine Translation 30/52

slide-48
SLIDE 48

Character-level Models

** V2C Generation Model Forward LSTM e s t a e s t a EOW SOW

Figure – V2C generation model

From Chracter-based Neural Machine Translation Gongbo Tang Neural Machine Translation 31/52

slide-49
SLIDE 49

Hybrid-level Models

from Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models Gongbo Tang Neural Machine Translation 32/52

slide-50
SLIDE 50

Subword-level Models

Byte pair encoding algorithm Frequent character n-grams (or whole words) are merged into a single symbol.

Gongbo Tang Neural Machine Translation 33/52

slide-51
SLIDE 51

Subword-level Models

Byte pair encoding algorithm Frequent character n-grams (or whole words) are merged into a single symbol.

Gongbo Tang Neural Machine Translation 34/52

slide-52
SLIDE 52

Subword-level Models

Byte pair encoding algorithm Frequent character n-grams (or whole words) are merged into a single symbol.

Gongbo Tang Neural Machine Translation 35/52

slide-53
SLIDE 53

Subword-level Models

Byte pair encoding algorithm Frequent character n-grams (or whole words) are merged into a single symbol.

Gongbo Tang Neural Machine Translation 36/52

slide-54
SLIDE 54

Subword-level Models

Byte pair encoding algorithm Frequent character n-grams (or whole words) are merged into a single symbol.

Gongbo Tang Neural Machine Translation 37/52

slide-55
SLIDE 55

Subword-level Models

Byte pair encoding algorithm Frequent character n-grams (or whole words) are merged into a single symbol.

Gongbo Tang Neural Machine Translation 38/52

slide-56
SLIDE 56

Subword-level Models

Byte pair encoding algorithm symbols can be applied to unknown words trade-off between text length and vocabulary size

Gongbo Tang Neural Machine Translation 39/52

slide-57
SLIDE 57

Subword-level Models

Byte pair encoding algorithm symbols can be applied to unknown words trade-off between text length and vocabulary size

Gongbo Tang Neural Machine Translation 40/52

slide-58
SLIDE 58

Subword-level Models

Byte pair encoding algorithm symbols can be applied to unknown words trade-off between text length and vocabulary size

Gongbo Tang Neural Machine Translation 41/52

slide-59
SLIDE 59

Subword-level Models

Byte pair encoding algorithm symbols can be applied to unknown words trade-off between text length and vocabulary size

Gongbo Tang Neural Machine Translation 42/52

slide-60
SLIDE 60

Subword-level Models

Byte pair encoding algorithm symbols can be applied to unknown words trade-off between text length and vocabulary size

Gongbo Tang Neural Machine Translation 43/52

slide-61
SLIDE 61

Subword-level Models

Byte pair encoding algorithm symbols can be applied to unknown words trade-off between text length and vocabulary size

Gongbo Tang Neural Machine Translation 44/52

slide-62
SLIDE 62

Monolingual Data

Use target-side monolingual data to enhance models.

Gongbo Tang Neural Machine Translation 45/52

slide-63
SLIDE 63

Monolingual Data

Use target-side monolingual data to enhance models. Dummy source No source sentence randomly sample from monolingual data each epoch freeze encoder/attention layers for monolingual training instances Synthetic source Produce synthetic source-side sentence via Back-translation. Back-translation : use a trained model on the opposite direction to generate source-side sentence. randomly sample from back-translated data synthetic and real parallel data are not distinguished

Gongbo Tang Neural Machine Translation 45/52

slide-64
SLIDE 64

Linguistic Features

morphological features part-of-speech tags syntactic dependency labels

Gongbo Tang Neural Machine Translation 46/52

slide-65
SLIDE 65

Linguistic Features

morphological features part-of-speech tags syntactic dependency labels Feature embeddings are concatenated to the word embeddings.

− → h j = tanh(− → W(

|F|

  • k=1

Ekxjk) + − → U − → h j−1)

|

E1(close) =   0.4 0.1 0.2   E2(adj) = 0.1 E1(close) E2(adj) =     0.4 0.1 0.2 0.1    

from Linguistic Input Features Improve Neural Machine Translation Gongbo Tang Neural Machine Translation 46/52

slide-66
SLIDE 66

Modelling Coverage

Problems

  • ver-translation : unnessarily translated for multiple times

under-translation : mistakenly untranslated Coverage Model coverage vector to guide the attention networks pay more attention to the untranslated words Context Gate Content words rely more on the source context Function words rely more on the target context control the ratio of source and target context

Gongbo Tang Neural Machine Translation 47/52

slide-67
SLIDE 67

Modelling Coverage

Problems

  • ver-translation : unnessarily translated for multiple times

under-translation : mistakenly untranslated Coverage Model coverage vector to guide the attention networks pay more attention to the untranslated words Context Gate Content words rely more on the source context Function words rely more on the target context control the ratio of source and target context

Gongbo Tang Neural Machine Translation 47/52

slide-68
SLIDE 68

Modelling Coverage

Problems

  • ver-translation : unnessarily translated for multiple times

under-translation : mistakenly untranslated Coverage Model coverage vector to guide the attention networks pay more attention to the untranslated words Context Gate Content words rely more on the source context Function words rely more on the target context control the ratio of source and target context

Gongbo Tang Neural Machine Translation 47/52

slide-69
SLIDE 69

Domain Adaption with Continued Training

SGD is sensitive to order of training instances best practice:

first train on all available data continue training on in-domain data

Large BLEU improvements reported with minutes of training time

[Sennrich et al., 2016, Luong and Manning, 2015, Crego et al., 2016]

tst2013 tst2014 tst2015

0.0 10.0 20.0 30.0 40.0

26.5 23.5 25.5 30.4 25.9 28.4

BLEU Fine-tuning in IWSLT (en-de) Baseline Finetuned

Generic system (≈ 8M sentences), Fine-tune with TED (≈ 200k )

Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Neural Machine Translation 48/52

slide-70
SLIDE 70

Transfer Learning

Use the hidden representations in NMT as pre-trained embeddings

1: Illustration of our approach, after (Belinkov et al., 2017): (i) NMT system trained on

Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks Gongbo Tang Neural Machine Translation 49/52

slide-71
SLIDE 71

Amount of Training Data

106 107 108 10 20 30 21.8 23.4 24.9 26.2 26.9 27.9 28.6 29.2 29.6 30.1 30.4 16.4 18.1 19.6 21.2 22.2 23.5 24.7 26.1 26.9 27.8 28.6 1.6 7.2 11.9 14.7 18.2 22.4 25.7 27.4 29.2 30.3 31.1 Corpus Size (English Words) BLEU Scores with Varying Amounts of Training Data Phrase-Based with Big LM Phrase-Based Neural

Gongbo Tang Neural Machine Translation 50/52

slide-72
SLIDE 72

Long Sentences

10 20 30 40 50 60 70 80 25 30 35 27.1 28.5 29.6 31 33 34.7 34.1 31.3 27.7 26.9 27.6 28.7 30.3 32.3 33.8 34.7 31.5 33.9 Sentence Length (source, subword count) BLEU BLEU Scores with Varying Sentence Length Neural Phrase-Based

Gongbo Tang Neural Machine Translation 51/52

slide-73
SLIDE 73

Furthur More ...

using discourse-level information external alignment to guide the attention combining NMT models with SMT models multilingual translation training criteria (beyond Maximum likelihood estimation) dual learning (less training data) unsupervised learning (only monolingual data)

Gongbo Tang Neural Machine Translation 52/52