Advanced Neural Machine Translation Gongbo Tang 23 September 2019 - - PowerPoint PPT Presentation

advanced neural machine translation
SMART_READER_LITE
LIVE PREVIEW

Advanced Neural Machine Translation Gongbo Tang 23 September 2019 - - PowerPoint PPT Presentation

Advanced Neural Machine Translation Gongbo Tang 23 September 2019 Outline NMT with Attention Mechanisms 1 Attention Mechanisms Understanding Attention Mechanisms Attention Variants NMT at Different Granularities 2 Hybrid Models


slide-1
SLIDE 1

Advanced Neural Machine Translation

Gongbo Tang

23 September 2019

slide-2
SLIDE 2

Outline

1

NMT with Attention Mechanisms Attention Mechanisms Understanding Attention Mechanisms Attention Variants

2

NMT at Different Granularities Hybrid Models Character-level Models subword-level Models

Gongbo Tang Advanced Neural Machine Translation 2/52

slide-3
SLIDE 3

Encoder-Decoder Architecture

Figure 1.6: Encoder-decoder architecture – example of the general approach for NMT. An encoder converts a source sentence into a meaning vector which is passed through a decoder to produce a translation.

Gongbo Tang Advanced Neural Machine Translation 3/52

slide-4
SLIDE 4

Encoder-Decoder with Attention

Encoder States Attention Input Context Hidden State Output Words

Gongbo Tang Advanced Neural Machine Translation 4/52

slide-5
SLIDE 5

Attentional NMT

Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Advanced Neural Machine Translation 5/52

slide-6
SLIDE 6

Attentional NMT

Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Advanced Neural Machine Translation 6/52

slide-7
SLIDE 7

Attentional NMT

Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Advanced Neural Machine Translation 7/52

slide-8
SLIDE 8

Attentional NMT

Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Advanced Neural Machine Translation 8/52

slide-9
SLIDE 9

Attentional NMT

Example from Rico Sennrich’s EACL 2017 NMT talk Gongbo Tang Advanced Neural Machine Translation 9/52

slide-10
SLIDE 10

Attention Mechanisms

αi h Encoder hidden states Attention Decoder Networks Softmax Predictions si−1 ci Gongbo Tang Advanced Neural Machine Translation 10/52

slide-11
SLIDE 11

Attention Mechanisms

αi h Encoder hidden states Attention Decoder Networks Softmax Predictions si−1 ci

Computation

eij = v⊤

a tanh(Wasi−1 + Uahj)

αij = softmax(eij) ci =

Tx

  • j=1

αijhj

Gongbo Tang Advanced Neural Machine Translation 10/52

slide-12
SLIDE 12

Attention Mechanisms

Methods to compute attention

score(ht, ¯ hs)=          h⊤

t ¯

hs dot h⊤

t Wa¯

hs general v⊤

a tanh

  • Wa[ht; ¯

hs]

  • concat

Notions Query (Q) : the decoder hidden state Keys (K) : encoder hidden states Values (V) : encoder hidden states

Gongbo Tang Advanced Neural Machine Translation 11/52

slide-13
SLIDE 13

Attention Mechanisms

Methods to compute attention

score(ht, ¯ hs)=          h⊤

t ¯

hs dot h⊤

t Wa¯

hs general v⊤

a tanh

  • Wa[ht; ¯

hs]

  • concat

Notions Query (Q) : the decoder hidden state Keys (K) : encoder hidden states Values (V) : encoder hidden states

Gongbo Tang Advanced Neural Machine Translation 11/52

slide-14
SLIDE 14

Attention Mechanisms

αt h Encoder hidden states Multi-head Attention Decoder Block Decoder Block Softmax Predictions st−1 ct st−1 ct

n-layer blocks

(b) Advanced attention mechanism

Gongbo Tang Advanced Neural Machine Translation 12/52

slide-15
SLIDE 15

Attention Mechanisms

Attention(Q, K, V ) = softmax(QKT √dk )V MultiHead(Q, K, V ) = Concat(head1, ..., headh)W O where headi = Attention(QW Q

i , KW K i , V W V i )

Multi-Head Attention

Gongbo Tang Advanced Neural Machine Translation 13/52

slide-16
SLIDE 16

CNN-based NMT models

Figure from Convolutional Sequence to Sequence Learning. Gongbo Tang Advanced Neural Machine Translation 14/52

slide-17
SLIDE 17

Transformer

Figure from Attention Is All You Need. Gongbo Tang Advanced Neural Machine Translation 15/52

slide-18
SLIDE 18

Understanding Attention Mechanisms

Motivation of Attention Mechanisms There is no alignment model in NMT. Attention Mechanisms try to mimic the alignment model and enable the models learning to translate and align jointly. Question Do attention mechanisms act as word alignment models in NMT ?

Gongbo Tang Advanced Neural Machine Translation 16/52

slide-19
SLIDE 19

Understanding Attention Mechanisms

Motivation of Attention Mechanisms There is no alignment model in NMT. Attention Mechanisms try to mimic the alignment model and enable the models learning to translate and align jointly. Question Do attention mechanisms act as word alignment models in NMT ?

Gongbo Tang Advanced Neural Machine Translation 16/52

slide-20
SLIDE 20

Attention Visualization

w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s> w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s> w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s>

(a) Layer 1 (b) Layer 2 (c) Layer 3

w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s> w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s> w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s>

(d) Layer 4 (e) Layer 5 (f) Layer 6

w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s> w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s> w e r g a r a n t i e r t d e n L e u t e n e i n e S t e l l e ? who guarantees people a job ? </s>

(g) Layer 7 (h) Layer 8 (i) RNN

Gongbo Tang Advanced Neural Machine Translation 17/52

slide-21
SLIDE 21

Attention as Alignment

Method AER global (location) 0.39 local-m (general) 0.34 local-p (general) 0.36 ensemble 0.34 Berkeley Aligner 0.32 Table 6: AER scores – results of various models

  • n the RWTH English-German alignment data.

1 2 3 4 5 6 layer L1 L2 L3 L4 L5 L6 Transformer 56.49 47.80 77.38 55.07 48.40 84.85 64.88 49.32 52.81 90.11 71.31 53.71 49.84 77.55 92.36 73.29 52.66 45.22 51.19 86.92 92.13 65.66(PMI) 100 1 2 3 4 5 6 layer L1 L2 L3 L4 L5 L6 Transformer 48.55 45.37 67.20 71.68 46.72 91.02 77.34 47.79 49.47 91.71 85.98 67.09 50.07 67.05 93.16 88.41 72.11 56.68 53.98 82.40 93.97 52.95(PMI) 100

Methods Tasks ZH⇒EN DE⇒EN FAST ALIGN 36.57 26.58 Attention mean 56.44 74.59 Attention best 45.22 53.98 EAM 38.88 39.25 PD 41.77 42.81

* Results are measured on TRANSFORMER-L6.

Table 1: AER of the proposed methods.

Findings RNN attention is better than Transformer attention. Attention is much worse than traditional alignment model.

Figure from Effective Approaches to Attention-based Neural Machine Translation and On the Word Alignment from Neural Machine Translation Gongbo Tang Advanced Neural Machine Translation 18/52

slide-22
SLIDE 22

Attention is not Alignment

Figure from What does Attention in Neural Machine Translation Pay Attention to ?. Gongbo Tang Advanced Neural Machine Translation 19/52

slide-23
SLIDE 23

Attention is not Alignment

relations between Obama and Netanyahu have been strained for years . die Beziehungen zwischen Obama und Netanjahu sind seit Jahren angespannt . 56 89 72 16 26 96 79 98 42 11 11 14 38 22 84 23 54 10 98 49 das Verhältnis zwischen Obama und Netanyahu ist seit Jahren gespannt . the relationship between Obama and Netanyahu has been stretched for years . 11 47 81 72 87 93 95 38 21 17 16 14 38 19 33 90 32 26 54 77 12 17

(a) Desired Alignment (b) Mismatched Alignment

Gongbo Tang Advanced Neural Machine Translation 20/52

slide-24
SLIDE 24

Attention Distribution

POS tag attention to alignment points % attention to

  • ther words %

NUM 73 27 NOUN 68 32 ADJ 66 34 PUNC 55 45 ADV 50 50 CONJ 50 50 VERB 49 51 ADP 47 53 DET 45 55 PRON 45 55 PRT 36 64 Overall 54 46

Table 8: Distribution of attention probability mass (in %) over alignment points and the rest of the words for each POS tag.

Figure from What does Attention in Neural Machine Translation Pay Attention to ?. Gongbo Tang Advanced Neural Machine Translation 21/52

slide-25
SLIDE 25

Attention and Word Sense Disambiguation

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Accuracy Attention range RNN Transformer

Figure 5: WSD accuracy over attention ranges.

Figure from An Analysis of Attention Mechanisms : The Case of Word Sense Disambiguation in NMT. Gongbo Tang Advanced Neural Machine Translation 22/52

slide-26
SLIDE 26

Attention and Anaphora

Figure 5: An example of an attention map between source and context. On the y-axis are the source tokens, on the x-axis the context tokens. Note the high attention between “it” and its antecedent “heart”.

Figure from Context-Aware Neural Machine Translation Learns Anaphora Resolution. Gongbo Tang Advanced Neural Machine Translation 23/52

slide-27
SLIDE 27

Attention and Unknown Words

  • 1. Find the corresponding

source word

  • 2. Lookup dictionary for

translation

  • Example from Zhaopeng Tu’s Tutorial,

method from On Using Very Large Target Vocabulary for Neural Machine Translation Gongbo Tang Advanced Neural Machine Translation 24/52

slide-28
SLIDE 28

Global Attention

yt ˜ ht ct at ht ¯ hs

Global align weights

Attention Layer

Context vector

Figure from Effective Approaches to Attention-based Neural Machine Translation. Gongbo Tang Advanced Neural Machine Translation 25/52

slide-29
SLIDE 29

Local Attention

yt ˜ ht ct at ht pt ¯ hs

Attention Layer

Context vector Local weights Aligned position

Figure from Effective Approaches to Attention-based Neural Machine Translation. Gongbo Tang Advanced Neural Machine Translation 26/52

slide-30
SLIDE 30

Sparse Attention

this is the last hundred years law

  • f

the last hundred <EOS> <SINK> . jahre hundert letzten der gesetz moores ist das . this is moore 's law last hundred years . <EOS>

i am going to give you the government government . <EOS> <SINK> . wählen ' regierung ' thema das nun werde ich now i am going to choose the government . <EOS>

Figure from Sparse and Constrained Attention for Neural Machine Translation. Gongbo Tang Advanced Neural Machine Translation 27/52

slide-31
SLIDE 31

Doubuly Attention

Figure – The conventional attention distribution

Constrains The attention weights over each source token should sum to 1.

Gongbo Tang Advanced Neural Machine Translation 28/52

slide-32
SLIDE 32

Fine-grained Attention Mechanism

(a) (b) Figure 1: (a) The conventional attention mechanism and (b) The proposed fine-grained attention

  • mechanism. Note that

t αt′,t = 1 in the conventional method, and t αd t′,t = 1 for all dimension

d in the proposed method.

Fine-Grained Attention Mechanism for Neural Machine Translation Gongbo Tang Advanced Neural Machine Translation 29/52

slide-33
SLIDE 33

Attention via Attention

Figure from Attention-via-Attention Neural Machine Translation. Gongbo Tang Advanced Neural Machine Translation 30/52

slide-34
SLIDE 34

Limited Vocabulary

Expensive computation during prediction Compute the probabilities over the entire target-side vocabulary. si-1 ci si

Context State

ti-1 ti

Word Prediction

yi-1 Eyi-1

Selected Word

yi Eyi

Embedding

ci-1

the cat this

  • f

fish there dog these

yi

Figure from Philipp Koehn’s slides Gongbo Tang Advanced Neural Machine Translation 31/52

slide-35
SLIDE 35

Limited Vocabulary

MT is an open-vocabulary problem

compounding and other productive morphological processes

they charge a carry-on bag fee. sie erheben eine Hand|gepäck|gebühr.

names

Obama(English; German) Обама (Russian) オバマ (o-ba-ma) (Japanese)

technical terms, numbers, etc.

Problem Limited vocabulary causes out-of-vocabulary words.

Gongbo Tang Advanced Neural Machine Translation 32/52

slide-36
SLIDE 36

Limited Vocabulary

MT is an open-vocabulary problem

compounding and other productive morphological processes

they charge a carry-on bag fee. sie erheben eine Hand|gepäck|gebühr.

names

Obama(English; German) Обама (Russian) オバマ (o-ba-ma) (Japanese)

technical terms, numbers, etc.

Problem Limited vocabulary causes out-of-vocabulary words.

Gongbo Tang Advanced Neural Machine Translation 32/52

slide-37
SLIDE 37

Hybrid-level (word+character) Models

Figure from Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models Gongbo Tang Advanced Neural Machine Translation 33/52

slide-38
SLIDE 38

Character-level Models

* C2W Compositional Model BLSTM W h e r e Word Vector for "Where"

Figure – Character to word Compositional model

Figure from Chracter-based Neural Machine Translation Gongbo Tang Advanced Neural Machine Translation 34/52

slide-39
SLIDE 39

Character-level Models

** V2C Generation Model Forward LSTM e s t a e s t a EOW SOW

Figure – Vector to character generation model

Figure from Chracter-based Neural Machine Translation Gongbo Tang Advanced Neural Machine Translation 35/52

slide-40
SLIDE 40

Character-level Models

Pros small vocabulary do not require segmentation/tokenization can model different, rare mopholigical variants of a word Cons sentence length is longer (harder for training) training time is longer

Gongbo Tang Advanced Neural Machine Translation 36/52

slide-41
SLIDE 41

Character-level Models

Pros small vocabulary do not require segmentation/tokenization can model different, rare mopholigical variants of a word Cons sentence length is longer (harder for training) training time is longer

Gongbo Tang Advanced Neural Machine Translation 36/52

slide-42
SLIDE 42

Subword-level Models

Byte pair encoding algorithm Frequent character n-grams (or whole words) are merged into a single symbol.

Gongbo Tang Advanced Neural Machine Translation 37/52

slide-43
SLIDE 43

Subword-level Models

Byte pair encoding algorithm Frequent character n-grams (or whole words) are merged into a single symbol.

Gongbo Tang Advanced Neural Machine Translation 38/52

slide-44
SLIDE 44

Subword-level Models

Byte pair encoding algorithm Frequent character n-grams (or whole words) are merged into a single symbol.

Gongbo Tang Advanced Neural Machine Translation 39/52

slide-45
SLIDE 45

Subword-level Models

Byte pair encoding algorithm Frequent character n-grams (or whole words) are merged into a single symbol.

Gongbo Tang Advanced Neural Machine Translation 40/52

slide-46
SLIDE 46

Subword-level Models

Byte pair encoding algorithm Frequent character n-grams (or whole words) are merged into a single symbol.

Gongbo Tang Advanced Neural Machine Translation 41/52

slide-47
SLIDE 47

Subword-level Models

Byte pair encoding algorithm Frequent character n-grams (or whole words) are merged into a single symbol.

Gongbo Tang Advanced Neural Machine Translation 42/52

slide-48
SLIDE 48

Subword-level Models

Byte pair encoding algorithm symbols can be applied to unknown words trade-off between text length and vocabulary size

Gongbo Tang Advanced Neural Machine Translation 43/52

slide-49
SLIDE 49

Subword-level Models

Byte pair encoding algorithm symbols can be applied to unknown words trade-off between text length and vocabulary size

Gongbo Tang Advanced Neural Machine Translation 44/52

slide-50
SLIDE 50

Subword-level Models

Byte pair encoding algorithm symbols can be applied to unknown words trade-off between text length and vocabulary size

Gongbo Tang Advanced Neural Machine Translation 45/52

slide-51
SLIDE 51

Subword-level Models

Byte pair encoding algorithm symbols can be applied to unknown words trade-off between text length and vocabulary size

Gongbo Tang Advanced Neural Machine Translation 46/52

slide-52
SLIDE 52

Subword-level Models

Byte pair encoding algorithm symbols can be applied to unknown words trade-off between text length and vocabulary size

Gongbo Tang Advanced Neural Machine Translation 47/52

slide-53
SLIDE 53

Subword-level Models

Byte pair encoding algorithm symbols can be applied to unknown words trade-off between text length and vocabulary size

Gongbo Tang Advanced Neural Machine Translation 48/52

slide-54
SLIDE 54

Subwords with Morphology

Linguistically Motivated Vocabulary Reduction not only based on frequency based on morphological knowledge

  • D. Ataman et al.

Linguistically Motivated Vocabulary Reduction for NMT (331–342)

  • 1. TED corpus, no-OOV case, voc=40K

Method BLEU TER CHRF3 No Segmentation 17.77 68.07 38.94 BPE 19.52 66.23 42.33 Supervised 21.61 61.76 44.01 LMVR 21.71 61.41 43.90 Input (Reference) Method Segmentation Output ağlarını BPE ağ@@ larını the cry (the nets) LMVR ağ +larını the nets Supervised ağ +Noun + A3pl <EOW> networks ağlamayacak BPE ağ@@ lamayacak will not survive (would not be crying) LMVR ağlama +yacak will not cry Supervised ağla +Neg +Fut +A3sg <EOW> will not cry

Table 5. Results of Experiment 1 - TED corpus and no-OOV case. Top: Output accuracies, where indicates statistically significant improvement over the BPE baseline (p-value 0.05). Bottom: Translation examples.

  • 6. Results and Discussion

Table 5 shows the performance of difgerent segmentation methods in Experiment 1. Our linguistically motivated vocabulary reduction (LMVR) method achieves the best performance on average, proving our hypothesis that a correct morphological repre- sentation generates more accurate translations. Our method outperforms the strong baseline of BPE-based segmentation by 2.2 BLEU, 4.8 TER and 1.6 CHR3F points. The performance is slightly higher than the supervised method, which is related to the ambiguity caused by loss of information during the morphological analysis. The pre- dicted vocabularies also indicate the signifjcant difgerence between LVMR and BPE, where 73% of the sub-word units in the vocabulary are completely difgerent. In order to better illustrate the properties of the generated sub-word units, we present example translations of two words from the test set. The two words have difgerent roots, the fjrst one is ağ (translation: net), and the second one is ağla (translation: (to) cry). BPE segments both words to the same root ağ, a character sequence frequently observed in root words in Turkish. In the fjrst case, both unsupervised methods segment the word into the same sub-word units, while the embedding of the sub-word unit segmented with BPE is semantically ambiguous and generates unreliable translations. On the

  • ther hand, our method can preserve the correct meaning in both cases.

In Experiment 2, we evaluate our method at difgerent rates of vocabulary reduc- tion according to the vocabulary sizes given in Table 4. All metrics confjrm that our method achieves better performance than the baseline in both experiments. In Exper- iment 2.a, at a vocabulary reduction rate of 4.25 (140K -> 40K), we obtain an improve- ment of 2.3 BLEU points over the baseline. In the most challenging case, Experiment

339 Brought to you by | Uppsala University Library Authenticated Download Date | 9/22/19 12:26 PM

Figure from Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English Gongbo Tang Advanced Neural Machine Translation 49/52

slide-55
SLIDE 55

Morphological Segmentation Tool

Morfessor

Figure 1: Screenshot from the Morfessor 2.0 demo.

Figure from Morfessor 2.0 : Toolkit for statistical morphological segmentation Gongbo Tang Advanced Neural Machine Translation 50/52

slide-56
SLIDE 56

Supplementary : Weight Tying/Sharing

ti = softmax

  • W(Usi−1 + V Eyi−1 + Cci)
  • (13.76)

E W

dh x |V|

ht

|V| x d

ct yt-1

.

d=dh

yt W

Softmax Decoder

(a) Typical output layer which is a softmax linear unit without or with weight tying (W = ET ).

Figure from paper : BeyondWeight Tying : Learning Joint Input-Output Embeddings for Neural Machine Translation Gongbo Tang Advanced Neural Machine Translation 51/52

slide-57
SLIDE 57

Attention

Abel GPU cluster account ! The next lecture will be in : Sal XI, Universitetshuset !

Gongbo Tang Advanced Neural Machine Translation 52/52