Advances and Challenges in Neural Machine Translation Gongbo Tang - - PowerPoint PPT Presentation

advances and challenges in neural machine translation
SMART_READER_LITE
LIVE PREVIEW

Advances and Challenges in Neural Machine Translation Gongbo Tang - - PowerPoint PPT Presentation

Advances and Challenges in Neural Machine Translation Gongbo Tang 26 September 2019 Outline Model Architectures 1 Nosiy Data 2 Monolingual Data 3 Domain Adaption 4 Coverage 5 Understanding NMT 6 Gongbo Tang Advances and Challenges


slide-1
SLIDE 1

Advances and Challenges in Neural Machine Translation

Gongbo Tang

26 September 2019

slide-2
SLIDE 2

Outline

1

Model Architectures

2

Nosiy Data

3

Monolingual Data

4

Domain Adaption

5

Coverage

6

Understanding NMT

Gongbo Tang Advances and Challenges in NMT 2/57

slide-3
SLIDE 3

The Best of Both Worlds

Encoder-decoders With residual feed-forward layers Cascaded encoder Multi-column encoder

(a) Cascaded Encoder (b) Multi-Column Encoder

Source : The Best of Both Worlds : Combining Recent Advances in Neural Machine Translation Gongbo Tang Advances and Challenges in NMT 3/57

slide-4
SLIDE 4

Star-Transformer

h1 h2 h3 h4 h5 h6 h7 h8 s h1 h2 h3 h4 h5 h6 h7 h8

Figure 1: Left: Connections of one layer in Trans- former, circle nodes indicate the hidden states of in- put tokens. Right: Connections of one layer in Star- Transformer, the square node is the virtual relay node. Red edges and blue edges are ring and radical connec- tions, respectively.

Source : Star-Transformer Gongbo Tang Advances and Challenges in NMT 4/57

slide-5
SLIDE 5

Modeling Recurrence for Transformer

Output of Transformer Encoder Output of Recurrence Encoder Source Multi-Head Attention Feed Forward Add & Norm Add & Norm

⊕ ☯

Positional Encoding N× Recurrence Modeling Feed Forward Norm Add & Norm Source Embedding Self-Attention Encoder Recurrence Encoder N×

Figure 2: The architecture of Transformer augmented with an additional recurrence encoder, the output of which is directly fed to the top decoder layer.

☯ e1 e2 e3 e4 e5 e6 h1 h2 h3 h4 h5 h6 h0 (a) Recurrent Neural Network

d

ence

e1 e2 e3 e4 e5 e6 c1 c2 h1 h2 h0 (b) Attentive Recurrent Network

Figure 3: Two implementations of recurrence model- ing: (a) standard RNN, and (b) the proposed ARN.

Recurrent Neural Network (RNN) An intu-

Source : Modeling Recurrence for Transformer Gongbo Tang Advances and Challenges in NMT 5/57

slide-6
SLIDE 6

Convolutional Self-Attention Networks

Bush held a talk with Sharon

(a) Vanilla SANs

Bush held a talk with Sharon

(b) 1D-Convolutional SANs

Bush held a talk with Sharon

(c) 2D-Convolutional SANs

Figure 1: Illustration of (a) vanilla SANs; (b) 1-dimensional convolution with the window size being 3; and (c) 2-dimensional convolution with the area being 3 × 3. Different colors represent different subspaces modeled by multi-head attention, and transparent colors denote masked tokens that are invisible to SANs.

Source : Convolutional Self-Attention Networks Gongbo Tang Advances and Challenges in NMT 6/57

slide-7
SLIDE 7

Lattice-Based Transformer Encoder

v0 v1 v2 v3

mao-yi-fa-zhan ju fu-zong-cai

v0 v1 v2 v3

mao-yi fa-zhan-ju fu-zong-cai

v0 v1 v2 v3 v4 v5

mao-yi fa-zhan ju fu zong-cai (1)Segmentaion 1 (2)Segmentaion 2 (3)Segmentation 3 (4)Lattice

v8

mao yi fa zhan ju fu zong cai

v7 v6 v5 v4 v3 v2 v1 v0

e0:2:mao-yi e2:4:fa-zhan e4:5:ju e5:6:fu e6:8:zong-cai e2:5:fa-zhan-ju e0:4:mao-yi-fa-zhan e5:8:fu-zong-cai c1 c2 c3 c4 c5 c6 c7 c8

Figure 1: Incorporating three different segmentation for a lattice graph. The original sentence is “mao-yi- fa-zhan-ju-fu-zong-cai”. In Chinese it is “贸易发展局 副总裁”. In English it means “The vice president of Trade Development Council”

Input Embedding Lattice sequence Inputs Lattice Positional Encoding Lattice-aware self-attention Add & Norm Add & Norm Feed Forward Hidden representations N x

t1 t2 t3 t4 t5 t1 t2 t3 t4 t5

Figure 2: The architecture of lattice-based Transformer

  • encoder. Lattice positional encoding is added to the

embeddings of lattice sequence inputs. Different colors in lattice-aware self-attention indicate different relation embeddings.

Source : Lattice-Based Transformer Encoder for Neural Machine Translation Gongbo Tang Advances and Challenges in NMT 7/57

slide-8
SLIDE 8

Incorporating Sentential Context

layer 2 layer 1 layer 3

(a) Vanilla

layer 2 layer 1 layer 3

(b) Shallow Sentential Context

layer 2 layer 1 layer 3

(c) Deep Sentential Context

Figure 1: Illustration of the proposed approache. As on a 3-layer encoder: (a) vanilla model without sentential context, (b) shallow sentential context representation (i.e. blue square) by exploiting the top encoder layer only; and (c) deep sentential context representation (i.e. brown square) by exploiting all encoder layers. The circles denote hidden states of individual tokens in the input sentence, and the squares denote the sentential context

  • representations. The red up arrows denote that the representations are fed to the subsequent decoder. This figure is

best viewed in color.

Source : Exploiting Sentential Context for Neural Machine Translation Gongbo Tang Advances and Challenges in NMT 8/57

slide-9
SLIDE 9

Tree Transformer

Layer 0 Layer 1 Layer 2

the cute dog is wagging its tail

Multi-Head Attention Add & Norm Feed Forward Add & Norm Constituent Attention

Constituent Priors the cute dog is wagging its tail the cute dog is wagging its tail

(A) (B) (C)

Figure 1: (A) A 3-layer Tree Transformer, where the blocks are constituents induced from the input sentence. The two neighboring constituents may merge together in the next layer, so the sizes of constituents gradually grow from layer to layer. The red arrows indicate the self-attention. (B) The building blocks of Tree Transformer. (C) Constituent prior C for the layer 1.

Source : Tree Transformer : Integrating Tree Structures into Self-Attention Gongbo Tang Advances and Challenges in NMT 9/57

slide-10
SLIDE 10

Noise in Training Data

  • Crawled parallel data from the web (very noisy)

SMT NMT WMT17 24.0 27.2 + Paracrawl 25.2 (+1.2) 17.3 (-9.9)

(German-English, 90m words each of WMT17 and Crawl data)

  • Corpus cleaning methods [Xu and Koehn, EMNLP 2017] give improvements

Source : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 10/57

slide-11
SLIDE 11

Noisy Data

Types of noise Misaligned sentences Disfluent language (from MT, bad translations) Wrong language data (e.g., French in German–English corpus) Untranslated sentences Short segments (e.g., dictionaries) Mismatched domain

Gongbo Tang Advances and Challenges in NMT 11/57

slide-12
SLIDE 12

Mismatched Sentences

  • Artificial created by randomly shuffling sentence order
  • Added to existing parallel corpus in different amounts

5% 10% 20% 50% 100% 24.0

  • 0.0

24.0

  • 0.0

23.9

  • 0.1

26.1 23.9

  • 1.1
  • 0.1

25.3 23.4

  • 1.9
  • 0.6
  • Bigger impact on NMT (green, left) than SMT (blue, right)

Source : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 12/57

slide-13
SLIDE 13

Misordered Words

  • Artificial created by randomly shuffling words in each sentence

5% 10% 20% 50% 100% Source 24.0

  • 0.0

23.6

  • 0.4

23.9

  • 0.1

26.6 23.6

  • 0.6
  • 0.4

25.5 23.7

  • 1.7
  • 0.3

Target 24.0

  • 0.0

24.0

  • 0.0

23.4

  • 0.6

26.7 23.2

  • 0.5
  • 0.8

26.1 22.9

  • 1.1
  • 1.1
  • Similar impact on NMT than SMT, worse for source reshuffle

Source : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 13/57

slide-14
SLIDE 14

Untranslated Sentences

5% 10% 20% 50% 100% Source 17.6 23.8

  • 9.8
  • 0.2

11.2 23.9

  • 16.0
  • 0.1

5.6 23.8

  • 21.6
  • 0.2

3.2 23.4

  • 24.0
  • 0.6

3.2 21.1

  • 24.0
  • 2.9

Target 27.2

  • 0.0

27.0

  • 0.2

26.7

  • 0.5

26.8

  • 0.4

26.9

  • 0.3

Source : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 14/57

slide-15
SLIDE 15

Short Sentences

5% 10% 20% 50% 1-2 words 27.1 24.1

  • 0.1 +0.1

26.5 23.9

  • 0.7
  • 0.1

26.7 23.8

  • 0.5
  • 0.2

1-5 words 27.8 24.2 +0.6 +0.2 27.6 24.5 +0.4 +0.5 28.0 24.5 +0.8 +0.5 26.6 24.2

  • 0.6 +0.2
  • No harm done

Source : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 15/57

slide-16
SLIDE 16

Amount of Training Data

106 107 108 10 20 30 21.8 23.4 24.9 26.2 26.9 27.9 28.6 29.2 29.6 30.1 30.4 16.4 18.1 19.6 21.2 22.2 23.5 24.7 26.1 26.9 27.8 28.6 1.6 7.2 11.9 14.7 18.2 22.4 25.7 27.4 29.2 30.3 31.1 Corpus Size (English Words) BLEU Scores with Varying Amounts of Training Data Phrase-Based with Big LM Phrase-Based Neural

Gongbo Tang Advances and Challenges in NMT 16/57

slide-17
SLIDE 17

Using Monolingual Data in NMT

Dummy source No source sentence randomly sample from monolingual data each epoch freeze encoder/attention layers for monolingual training instances Synthetic source Produce synthetic source-side sentence via back translation. Back-translation : use a trained model on the opposite direction to generate source-side sentence.

Gongbo Tang Advances and Challenges in NMT 17/57

slide-18
SLIDE 18

Back Translation

Steps train a system in reverse language direction use the system to translate target-side monolingual data combine both real parallel data and synthetic parallel data

reverse system final system

Figure from : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 18/57

slide-19
SLIDE 19

Iterative Back Translation

back system 2 final system back system 1

Figure from : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 19/57

slide-20
SLIDE 20

Dual Learning

  • We could iterate through steps of

– train system – create synthetic corpus

  • Dual learning: train models in both directions together

– translation models F → E and E → F – take sentence f – translate into sentence e’ – translate that back into sentence f’ – training objective: f should match f’

  • Setup could be fooled by just copying (e’ = f)

⇒ score e’ with a language for language E add language model score as cost to training objective

Source : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 20/57

slide-21
SLIDE 21

Dual Learning

MT F→E MT E→F e f LM E LM F

Figure from : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 21/57

slide-22
SLIDE 22

Domain Adaption

  • Better quality when system is adapted to a task
  • Domain adaptation to a specific domain, e.g., information technology
  • Some training more relevant
  • May also adapt to specific user (personalization)
  • May optimize for a specific document or sentence

Figure from : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 22/57

slide-23
SLIDE 23

Domains

Medical Abilify is a medicine containing the active substance aripiprazole. It is available as 5 mg, 10 mg, 15 mg and 30 mg tablets, as 10 mg, 15 mg and 30 mg orodispersible tablets (tablets that dissolve in the mouth), as an oral solution (1 mg/ml) and as a solution for injection (7.5 mg/ml). Software Localization Default GNOME Theme OK People Literature There was a slight noise behind her and she turned just in time to seize a small boy by the slack of his roundabout and arrest his flight. Law Corrigendum to the Interim Agreement with a view to an Economic Partnership Agreement between the European Community and its Member States, of the one part, and the Central Africa Party, of the other part. Religion This is The Book free of doubt and involution, a guidance for those who preserve themselves from evil and follow the straight path. News The Facebook page of a leading Iranian leading cartoonist, Mana Nayestani, was hacked on Tuesday, 11 September 2012, by pro-regime hackers who call themselves ”Soldiers of Islam”. Movie subtitles We’re taking you to Washington, D.C. Do you know where the prisoner was transported to? Uh, Washington. Okay. Twitter Thank u @Starbucks & @Spotify for celebrating artists who #GiveGood with a donation to @BTWFoundation, and to great organizations by @Metallica and @ChanceTheRapper! Limited edition cards available now at Starbucks! Figure from : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 23/57

slide-24
SLIDE 24

Domain Differences

Topic The subject matter of the text, such as politics or sports. Modality How was this text originally created? Is this written text or transcribed speech, and if speech, is it a formal presentation or an informal dialogue full of incompleted and ungrammatical sentences? Register Level of politeness. In some languages, this is very explicit, such as the use of the informal Du or the formal Sie for the personal pronoun you in German. Intent Is the text a statement of fact, an attempt to persuade, or communication between multiple parties? Style Is it a terse informal text, are full of emotional and flowery language?

Figure from : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 24/57

slide-25
SLIDE 25

Domain Adaption

Source Schaue um dich herum. Ref. Look around you. All NMT: Look around you. SMT: Look around you. Law NMT: Sughum gravecorn. SMT: In order to implement dich Schaue . Medical NMT: EMEA / MB / 049 / 01-EN-Final Work progamme for 2002 SMT: Schaue by dich around . IT NMT: Switches to paused. SMT: To Schaue by itself . \t \t Koran NMT: Take heed of your own souls. SMT: And you see. Subtitles NMT: Look around you. SMT: Look around you .

Figure from : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 25/57

slide-26
SLIDE 26

Domain Adaption

System ↓ Law Medical IT Koran Subtitles All Data 30.532.8 45.142.2 35.344.7 17.917.9 26.420.8 Law 31.134.4 12.118.2 3.5 6.9 1.3 2.2 2.8 6.0 Medical 3.910.2 39.443.5 2.0 8.5 0.6 2.0 1.4 5.8 IT 1.9 3.7 6.5 5.3 42.139.8 1.8 1.6 3.9 4.7 Koran 0.4 1.8 0.0 2.1 0.0 2.3 15.918.8 1.0 5.5 Subtitles 7.0 9.9 9.317.8 9.213.6 9.0 8.4 25.922.1

Figure from : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 26/57

slide-27
SLIDE 27

Data Combination

Combined Domain Model

Figure from : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 27/57

slide-28
SLIDE 28

Data Combination

Combined Domain Model Out-of-domain data In-domain data

Oversample in-domain data

Figure from : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 28/57

slide-29
SLIDE 29

Model Combination

In Domain Model Out-of Domain Model

Figure from : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 29/57

slide-30
SLIDE 30

Topic Models

  • Cluster corpus by topic — Latent Dirichlet Allocation (LDA)
  • Train separate sub-models for each topic
  • For input sentence, detect topic (or topic distribution)

Figure from : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 30/57

slide-31
SLIDE 31

Data Sampling

Combined Domain Model

  • Select out-of-domain sentence pairs that are similar to in-domain data

Figure from : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 31/57

slide-32
SLIDE 32

Data Sampling

29

Moore Lewis

In-Domain Language Model Out-of Domain Language Model

score score

  • Build language models

– out of domain – in domain

  • Score each sentence
  • Sub-select sentence pairs with

pIN(f) − pOUT(f) > τ

Figure from : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 32/57

slide-33
SLIDE 33

Data Sampling

Modified Moore Lewis

In-Domain Language Model (source) Out-of Domain Language Model (source)

score score

Out-of Domain Language Model (target) In-Domain Language Model (target)

  • 2 sets of language models

– source language – target language

  • Add scores

Figure from : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 33/57

slide-34
SLIDE 34

Fine Tuning

In Domain Model Out-of Domain Model +

  • First train system on out-of-domain data (or: all available data)
  • Stop at convergence
  • Then, continue training on in-domain data

Figure from : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 34/57

slide-35
SLIDE 35

Curriculum Training

  • Recall: relevance score for each sentence pair
  • Training epochs

– start with all data (100%) – train only on somewhat relevant data (50%) – train only on relevant data (25%) – train only on very relevant data (10%)

Figure from : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 35/57

slide-36
SLIDE 36

Adequacy and Fluency

(from: Sennrich and Haddow, 2017)

From Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 36/57

slide-37
SLIDE 37

Over-generation and Under-translatin

in

  • rder

to solve the problem , the ” Social Housing ” alliance suggests a fresh start . um das Problem zu l¨

  • sen

, schl¨ agt das Unternehmen der Gesellschaft f¨ ur soziale Bildung vor . 37 33 63 81 84 10 80 12 40 13 71 18 86 84 80 45 40 12 10 41 44 10 89 10 40 37 10 30 80 11 13

43 7 46 161 108 89 62 112 392 121 110 130 26 132 22 19 6 6 From Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 37/57

slide-38
SLIDE 38

Modeling Coverage

  • Track coverage during decoding

coverage(j) =

  • i

αi,j

  • ver-generation = max
  • 0,
  • j

coverage(j) − 1

  • under-generation = min
  • 1,
  • j

coverage(j)

  • Add additional penalty functions to score hypotheses

From Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 38/57

slide-39
SLIDE 39

Modeling Coverage

  • Extend translation model
  • Use vector that accumulates coverage of input words to inform attention

– raw attention score a(si−1, hj) – informed by previous decoder state si−1 and input word hj – add conditioning on coverage(j) a(si−1, hj) = W asi−1 + U ahj + V acoverage(j) + ba

  • Coverage tracking may also be integrated into the training objective.

log

  • i

P(yi|x) + λ

  • j

(1 − coverage(j))2

From Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 39/57

slide-40
SLIDE 40

Modeling Coverage

(a) Context Gate (source) (b) Context Gate (target) (c) Context Gate (both) Figure 4: Architectures of NMT with various context gates, which either scale only one side of translation contexts (i.e., source context in (a) and target context in (b)) or control the effects of both sides (i.e., (c)).

Figure from Context Gates for Neural Machine Translation Gongbo Tang Advances and Challenges in NMT 40/57

slide-41
SLIDE 41

Coverage-aware Method

.4 .3 .4 .3

.3

.8 .8 .7 H a v e y

  • u

l e a r n e d n

  • t

h i n g ? E O S nˇ ı shˇ en m¯ e d¯

  • u

m´ ei xu´ e d` ao ? EOS

0.3 0.4 0.1 0.2 0.2 0.2 0.4 0.1

c(x, y) = |x|

i

log max(|y|

j

aij, β) = log 0.8 + log 1.2 ... = 1.5

  • |y|

j

a1j=0.7

|y|

j

a2j=1.2

max max

max(0.7,β)=0.8 max(1.2,β)=1.2

attention → coverage (i = 1) attention → coverage (i = 2)

Figure 1: The coverage score for a running example (Chinese pinyin-English and β = 0.8).

Figure from A Simple and Effective Approach to Coverage-Aware Neural Machine Translation Gongbo Tang Advances and Challenges in NMT 41/57

slide-42
SLIDE 42

What is in a representaion ?

What is contained in an intermediate representation ? word embedding encoder state decoder state More specific questions ? Does the model discover morphological properties ? Does the model disambiguate words ?

Gongbo Tang Advances and Challenges in NMT 42/57

slide-43
SLIDE 43

What is in a representaion ?

What is contained in an intermediate representation ? word embedding encoder state decoder state More specific questions ? Does the model discover morphological properties ? Does the model disambiguate words ?

Gongbo Tang Advances and Challenges in NMT 42/57

slide-44
SLIDE 44

Probing Classifier

  • Pose a hypothesis, e.g.,

Encoder states discover part-of-speech.

  • Formalize this as a classification problem

– given: encoder state for word dog – label: singular noun (NN)

  • Train on representations generated by running inference

– translate sentences not seen during training – record their encoder states – look up their part of speech tags (running POS tagger or use labeled data) → training example (encoder state ; label)

  • Test on new sentences

From Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 43/57

slide-45
SLIDE 45

Probing Classifier

1: Illustration of our approach, after (Belinkov et al., 2017): (i) NMT system trained on

Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks Gongbo Tang Advances and Challenges in NMT 44/57

slide-46
SLIDE 46

Source Syntax

  • LSTM sequence-to-sequence model without attention
  • Different tasks

– translate English into Russian, German – copy English – copy permuted English – parse English into linearized parse structure

  • Predict

– constituent phrase (NP, VP, etc.) – passive voice and tense

  • Findings

– much better quality when translating than majority class – same quality for copying as majority class

Does String-Based Neural MT Learn Source Syntax ? Gongbo Tang Advances and Challenges in NMT 45/57

slide-47
SLIDE 47

POS Tagging and Semantic Tagging

  • Attentional neural machine translation model
  • Predict

– part-of-speech tag – semantic tag ∗ type of named entity ∗ semantic relationships ∗ discourse relationships

  • Findings

– compare prediction quality of different encoder layers – mostly better performance at deeper layers – little impact from target language

Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks Gongbo Tang Advances and Challenges in NMT 46/57

slide-48
SLIDE 48

Morphology

  • Attentional neural machine translation model with character-based word

embeddings

  • Predict for morphologically rich input languages

– part-of-speech tag – morphological properties

  • Findings

– character-based representations much better for learning morphology – word-based models are sufficient for learning structure of common words – lower layers better at word structure, deeper layers better at word meaning – target language matters for what information is learned – neural decoder learns very little about word structure

What do Neural Machine Translation Models Learn about Morphology ? Gongbo Tang Advances and Challenges in NMT 47/57

slide-49
SLIDE 49

Word Sense Disambiguation

hidden representations of tokens in a sentence word embeddings concatenation Feed-forward classifier Correct or Incorrect ambiguous noun translation general word

DE→EN DE→FR RNN. Trans. RNN. Trans. BLEU 29.1 32.6 17.0 19.3 Embedding 63.1 63.2 68.7 68.9 ENC 94.2 97.2 91.7 95.6 DEC 97.9 91.2 95.1 91.6 Table 2: BLEU scores of NMT models, and WSD ac- curacy on the test set using word embeddings or hid- den states to represent ambiguous nouns. The hidden states are from the highest layer.4 RNN. and Trans. de- note RNNS2S and Transformer models, respectively.

Encoders Help You DisambiguateWord Senses in Neural Machine Translation Gongbo Tang Advances and Challenges in NMT 48/57

slide-50
SLIDE 50

Transfer Learning

Use the hidden representations in NMT as pre-trained embeddings

1: Illustration of our approach, after (Belinkov et al., 2017): (i) NMT system trained on

Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks Gongbo Tang Advances and Challenges in NMT 49/57

slide-51
SLIDE 51

Explainable AI

Important questions for users Why did the network reach this decision ? Solutions Tracking back decisions to inputs

Gongbo Tang Advances and Challenges in NMT 50/57

slide-52
SLIDE 52

What Determined Output Decision ?

What part of the network had the biggest impact

  • n the final decision ?

prediction of a specific output word which of the input words mattered most ? which of the previous output words mattered most ?

Gongbo Tang Advances and Challenges in NMT 51/57

slide-53
SLIDE 53

Layer-Wise Relevance Propagation

  • Start with output prediction

i.e., high value for word in softmax

  • Compute backwards what contributed to this high value
  • First step

– consider values of previous layer – consider weights from previous layer – assign relevance values for each node in previous layer – normalize so they add up to one

  • Recurse until input layer is reached

Source : Philipp Koehn’s slides Gongbo Tang Advances and Challenges in NMT 52/57

slide-54
SLIDE 54

Layer-Wise Relevance Propagation

Example: Chinese–English

Source : Philipp Koehn’s slides Reference : Visualizing and Understanding Neural Machine Translation Gongbo Tang Advances and Challenges in NMT 53/57

slide-55
SLIDE 55

Identifying Neurons

Questions How are specific properties encoded ? (Easiest case : in a single neuron) How do we find it ? Examples : length of sequence given : encoder-decoder model without attention does the encoder record the length of the consumed sequence ? does the decoder record the length of the generated sequence ?

Gongbo Tang Advances and Challenges in NMT 54/57

slide-56
SLIDE 56

Correlation

Steps select a neuron compute correlation

value of neuron when processing xth word position x

success if highly correlated neuron found Examples : length of sequence given : encoder-decoder model without attention does the encoder record the length of the consumed sequence ? does the decoder record the length of the generated sequence ?

Gongbo Tang Advances and Challenges in NMT 55/57

slide-57
SLIDE 57

Neurons Correlated with Length

5 10 15 20 25 5 10 15 20 25 30 20 15 10 5 5 Encoding Decoding

Unit 109 Unit 334 log P(<EOS>)

Figure 4: Action of translation unit109 and unit334 during the encoding and decoding of a sample sentence. Also shown is the softmax log-prob of output <EOS>.

Figure from : Why Neural Translations are the Right Length Gongbo Tang Advances and Challenges in NMT 56/57

slide-58
SLIDE 58

More Advances and Challenges

More Advances and Challenges multi-task learning document-level translation low-resource languages unsupervised NMT automatic post-editing quality estimation test suits for MT evaluation robustness parallel decoding speech translation ...

Gongbo Tang Advances and Challenges in NMT 57/57