Selective Attention for Context-aware Neural Machine Translation - - PowerPoint PPT Presentation

selective attention for context aware neural machine
SMART_READER_LITE
LIVE PREVIEW

Selective Attention for Context-aware Neural Machine Translation - - PowerPoint PPT Presentation

Selective Attention for Context-aware Neural Machine Translation Sameen Maruf , Andr e F. T. Martins , Gholamreza Haffari Faculty of Information Technology, Monash University, Australia Unbabel & Instituto de


slide-1
SLIDE 1

Selective Attention for Context-aware Neural Machine Translation

Sameen Maruf†, Andr´ e F. T. Martins‡, Gholamreza Haffari†

†Faculty of Information Technology, Monash University, Australia ‡Unbabel & Instituto de Telecomunica¸

  • es, Lisbon, Portugal

NAACL-HLT, Minneapolis, June, 2019

1 / 31

slide-2
SLIDE 2

Overview

1

The Whys?

2

Proposed Approach

3

Experiments and Analyses

4

Summary

2 / 31

slide-3
SLIDE 3

The Whys?

Overview

1

The Whys?

2

Proposed Approach

3

Experiments and Analyses

4

Summary

3 / 31

slide-4
SLIDE 4

The Whys?

Why document-level machine translation?

4 / 31

slide-5
SLIDE 5

The Whys?

Why document-level machine translation?

Most state-of-the-art NMT models translate sentences independently

4 / 31

slide-6
SLIDE 6

The Whys?

Why document-level machine translation?

Most state-of-the-art NMT models translate sentences independently Discourse phenomena are ignored, e.g., pronominal anaphora and coherence, which may have long-range dependency

4 / 31

slide-7
SLIDE 7

The Whys?

Why document-level machine translation?

Most state-of-the-art NMT models translate sentences independently Discourse phenomena are ignored, e.g., pronominal anaphora and coherence, which may have long-range dependency Most of the works in document NMT focus on using a few previous sentences as context ignoring the rest of the document

[Jean et al., 2017, Wang et al., 2017, Bawden et al., 2018, Voita et al., 2018, Tu et al., 2018, Zhang et al., 2018, Miculicich et al., 2018]

4 / 31

slide-8
SLIDE 8

The Whys?

Why document-level machine translation?

Most state-of-the-art NMT models translate sentences independently Discourse phenomena are ignored, e.g., pronominal anaphora and coherence, which may have long-range dependency Most of the works in document NMT focus on using a few previous sentences as context ignoring the rest of the document

[Jean et al., 2017, Wang et al., 2017, Bawden et al., 2018, Voita et al., 2018, Tu et al., 2018, Zhang et al., 2018, Miculicich et al., 2018]

The global document context for MT [Maruf and Haffari, 2018]

4 / 31

slide-9
SLIDE 9

The Whys?

Why selective attention for document MT?

5 / 31

slide-10
SLIDE 10

The Whys?

Why selective attention for document MT?

slide-11
SLIDE 11

The Whys?

Why selective attention for document MT?

Soft attention over words in the document context

5 / 31

slide-12
SLIDE 12

The Whys?

Why selective attention for document MT?

Soft attention over words in the document context Forms a long-tail absorbing significant probability mass

5 / 31

slide-13
SLIDE 13

The Whys?

Why selective attention for document MT?

Soft attention over words in the document context Forms a long-tail absorbing significant probability mass Incapable of ignoring irrelevant words

5 / 31

slide-14
SLIDE 14

The Whys?

Why selective attention for document MT?

Soft attention over words in the document context Forms a long-tail absorbing significant probability mass Incapable of ignoring irrelevant words Not scalable to long documents

5 / 31

slide-15
SLIDE 15

The Whys?

This Work

We propose a sparse and hierarchical attention approach for document NMT which: identifies the key sentences in the global document context, and attends to the key words within those sentences

6 / 31

slide-16
SLIDE 16

Proposed Approach

Overview

1

The Whys?

2

Proposed Approach

3

Experiments and Analyses

4

Summary

7 / 31

slide-17
SLIDE 17

Proposed Approach

Hierarchical Selective Context Attention

8 / 31

slide-18
SLIDE 18

Proposed Approach

Hierarchical Selective Context Attention

For each query word:

αs: attention weights given to sentences in context

slide-19
SLIDE 19

Proposed Approach

Hierarchical Selective Context Attention

For each query word:

αs: attention weights given to sentences in context αw: attention weights given to words in context

slide-20
SLIDE 20

Proposed Approach

Hierarchical Selective Context Attention

For each query word:

αs: attention weights given to sentences in context αw: attention weights given to words in context αhier: re-scaled attention weights of words in context

slide-21
SLIDE 21

Proposed Approach

Hierarchical Selective Context Attention

For each query word:

αs: attention weights given to sentences in context αw: attention weights given to words in context αhier: re-scaled attention weights of words in context Vw: from words in context

8 / 31

slide-22
SLIDE 22

Proposed Approach

Hierarchical Selective Attention over Source Document

9 / 31

slide-23
SLIDE 23

Proposed Approach

Hierarchical Selective Attention over Source Document

1 Sparse sentence-level key matching: identify relevant sentences

Qs: representation of words in current sentence Ks: representation of sentences in context

slide-24
SLIDE 24

Proposed Approach

Hierarchical Selective Attention over Source Document

1 Sparse sentence-level key matching: identify relevant sentences

Qs: representation of words in current sentence Ks: representation of sentences in context

slide-25
SLIDE 25

Proposed Approach

Hierarchical Selective Attention over Source Document

1 Sparse sentence-level key matching: identify relevant sentences

Qs: representation of words in current sentence Ks: representation of sentences in context

9 / 31

slide-26
SLIDE 26

Proposed Approach

Hierarchical Selective Attention over Source Document

2 Sparse word-level key matching: identify relevant words in relevant

sentences

Qw: representation of words in current sentence Kw: representation of words in context

slide-27
SLIDE 27

Proposed Approach

Hierarchical Selective Attention over Source Document

2 Sparse word-level key matching: identify relevant words in relevant

sentences

Qw: representation of words in current sentence Kw: representation of words in context

slide-28
SLIDE 28

Proposed Approach

Hierarchical Selective Attention over Source Document

2 Sparse word-level key matching: identify relevant words in relevant

sentences

Qw: representation of words in current sentence Kw: representation of words in context

10 / 31

slide-29
SLIDE 29

Proposed Approach

Hierarchical Selective Attention over Source Document

3 Re-scale attention weights

slide-30
SLIDE 30

Proposed Approach

Hierarchical Selective Attention over Source Document

3 Re-scale attention weights 11 / 31

slide-31
SLIDE 31

Proposed Approach

Hierarchical Selective Attention over Source Document

4 Read the word-level values with the attention weights

slide-32
SLIDE 32

Proposed Approach

Hierarchical Selective Attention over Source Document

4 Read the word-level values with the attention weights

slide-33
SLIDE 33

Proposed Approach

Hierarchical Selective Attention over Source Document

4 Read the word-level values with the attention weights

Our sparse hierarchical attention module is able to selectively focus on relevant sentences in the document context and then attends to key words in those sentences

12 / 31

slide-34
SLIDE 34

Proposed Approach

Flat Attention over Source Document

13 / 31

slide-35
SLIDE 35

Proposed Approach

Flat Attention over Source Document

Soft sentence-level attention over all sentences in the document context

13 / 31

slide-36
SLIDE 36

Proposed Approach

Flat Attention over Source Document

Soft sentence-level attention over all sentences in the document context

K, V : representation of sentences in context

slide-37
SLIDE 37

Proposed Approach

Flat Attention over Source Document

Soft sentence-level attention over all sentences in the document context

K, V : representation of sentences in context

13 / 31

slide-38
SLIDE 38

Proposed Approach

Flat Attention over Source Document

Soft sentence-level attention over all sentences in the document context

K, V : representation of sentences in context

Comparison to [Maruf and Haffari, 2018]:

13 / 31

slide-39
SLIDE 39

Proposed Approach

Flat Attention over Source Document

Soft sentence-level attention over all sentences in the document context

K, V : representation of sentences in context

Comparison to [Maruf and Haffari, 2018]:

  • multi-head attention

13 / 31

slide-40
SLIDE 40

Proposed Approach

Flat Attention over Source Document

Soft sentence-level attention over all sentences in the document context

K, V : representation of sentences in context

Comparison to [Maruf and Haffari, 2018]:

  • multi-head attention
  • dynamic

13 / 31

slide-41
SLIDE 41

Proposed Approach

Flat Attention over Source Document

Soft word-level attention over all words in the document context

K, V : representation of words in context

14 / 31

slide-42
SLIDE 42

Proposed Approach

Document-level Context Layer

Hierarchical selective or Flat

15 / 31

slide-43
SLIDE 43

Proposed Approach

Document-level Context Layer

Hierarchical selective or Flat

15 / 31

slide-44
SLIDE 44

Proposed Approach

Document-level Context Layer

Hierarchical selective or Flat Monolingual context (source) integrated in encoder

15 / 31

slide-45
SLIDE 45

Proposed Approach

Document-level Context Layer

Hierarchical selective or Flat Monolingual context (source) integrated in encoder Bilingual context (source & target) integrated in decoder

15 / 31

slide-46
SLIDE 46

Proposed Approach

Our Models and Settings

16 / 31

slide-47
SLIDE 47

Proposed Approach

Our Models and Settings

Our Models:

16 / 31

slide-48
SLIDE 48

Proposed Approach

Our Models and Settings

Our Models: Hierarchical Attention over context

  • sparse at sentence-level, soft at word-level
  • sparse at both sentence and word-level

16 / 31

slide-49
SLIDE 49

Proposed Approach

Our Models and Settings

Our Models: Hierarchical Attention over context

  • sparse at sentence-level, soft at word-level
  • sparse at both sentence and word-level

Flat Attention over context

  • soft at sentence-level
  • soft at word-level

16 / 31

slide-50
SLIDE 50

Proposed Approach

Our Models and Settings

Our Models: Hierarchical Attention over context

  • sparse at sentence-level, soft at word-level
  • sparse at both sentence and word-level

Flat Attention over context

  • soft at sentence-level
  • soft at word-level

Our Settings: Offline document MT Online document MT

16 / 31

slide-51
SLIDE 51

Experiments and Analyses

Overview

1

The Whys?

2

Proposed Approach

3

Experiments and Analyses

4

Summary

17 / 31

slide-52
SLIDE 52

Experiments and Analyses

Experimental Setup

Training/dev/test corpora statistics for En-De:

Domain #Sentences Document length TED 0.21M/9K/2.3K 120.89/96.42/98.74 News 0.24M/2K/3K 38.93/26.78/19.35 Europarl 1.67M/3.6K/5.1K 14.14/14.95/14.06

Baselines: Context-agnostic baselines (RNNSearch, Transformer) Local source context baselines for online document MT:

  • [Zhang et al., 2018] & [Miculicich et al., 2018]

Evaluation Metrics: BLEU, METEOR

18 / 31

slide-53
SLIDE 53

Experiments and Analyses

Bilingual Context integration in Decoder (Online Setting)

TED 22 22.5 23 23.5 24 24.5 25

23.28

BLEU News 22 22.5 23 23.5 24 24.5 25

22.78 Transformer

Europarl 27.5 28 28.5 29 29.5 30 30.5

28.72 19 / 31

slide-54
SLIDE 54

Experiments and Analyses

Bilingual Context integration in Decoder (Online Setting)

TED 22 22.5 23 23.5 24 24.5 25

23.28 24.39

BLEU News 22 22.5 23 23.5 24 24.5 25

22.78 24.38 Transformer [Miculicich et al., 2018]

Europarl 27.5 28 28.5 29 29.5 30 30.5

28.72 29.58 19 / 31

slide-55
SLIDE 55

Experiments and Analyses

Bilingual Context integration in Decoder (Online Setting)

TED 22 22.5 23 23.5 24 24.5 25

23.28 24.39 24.29 24.02

BLEU News 22 22.5 23 23.5 24 24.5 25

22.78 24.38 24.75 24.17 Transformer [Miculicich et al., 2018] Attention(sent) Attention(word)

Europarl 27.5 28 28.5 29 29.5 30 30.5

28.72 29.58 29.56 29.9 19 / 31

slide-56
SLIDE 56

Experiments and Analyses

Bilingual Context integration in Decoder (Online Setting)

TED 22 22.5 23 23.5 24 24.5 25

23.28 24.39 24.29 24.02 24.62 24.43

BLEU News 22 22.5 23 23.5 24 24.5 25

22.78 24.38 24.75 24.17 24.36 24.58 Transformer [Miculicich et al., 2018] Attention(sent) Attention(word) H-Attention(sp-soft) H-Attention(sp-sp)

Europarl 27.5 28 28.5 29 29.5 30 30.5

28.72 29.58 29.56 29.9 29.8 29.64 19 / 31

slide-57
SLIDE 57

Experiments and Analyses

Bilingual Context integration in Decoder (Online Setting)

TED 22 22.5 23 23.5 24 24.5 25

23.28 24.39 24.29 24.02 24.62 24.43

BLEU News 22 22.5 23 23.5 24 24.5 25

22.78 24.38 24.75 24.17 24.36 24.58 Transformer [Miculicich et al., 2018] Attention(sent) Attention(word) H-Attention(sp-soft) H-Attention(sp-sp)

Europarl 27.5 28 28.5 29 29.5 30 30.5

28.72 29.58 29.56 29.9 29.8 29.64 19 / 31

slide-58
SLIDE 58

Experiments and Analyses

Bilingual Context integration in Decoder (Online Setting)

TED 22 22.5 23 23.5 24 24.5 25

23.28 24.39 24.29 24.02 24.62 24.43

BLEU News 22 22.5 23 23.5 24 24.5 25

22.78 24.38 24.75 24.17 24.36 24.58 Transformer [Miculicich et al., 2018] Attention(sent) Attention(word) H-Attention(sp-soft) H-Attention(sp-sp)

Europarl 27.5 28 28.5 29 29.5 30 30.5

28.72 29.58 29.56 29.9 29.8 29.64 19 / 31

slide-59
SLIDE 59

Experiments and Analyses

Bilingual Context integration in Decoder (Online Setting)

TED 22 22.5 23 23.5 24 24.5 25

23.28 24.39 24.29 24.02 24.62 24.43

BLEU News 22 22.5 23 23.5 24 24.5 25

22.78 24.38 24.75 24.17 24.36 24.58 Transformer [Miculicich et al., 2018] Attention(sent) Attention(word) H-Attention(sp-soft) H-Attention(sp-sp)

Europarl 27.5 28 28.5 29 29.5 30 30.5

28.72 29.58 29.56 29.9 29.8 29.64 19 / 31

slide-60
SLIDE 60

Experiments and Analyses

Analyses

Automatic evaluation metrics for translation do not assess how well models translate inter-sentential phenomena

20 / 31

slide-61
SLIDE 61

Experiments and Analyses

Analyses

Automatic evaluation metrics for translation do not assess how well models translate inter-sentential phenomena Measure accuracy of translating English pronoun it to its German counterparts es, er and sie using a contrastive test set [M¨ uller et al., 2018]

20 / 31

slide-62
SLIDE 62

Experiments and Analyses

Analyses

Automatic evaluation metrics for translation do not assess how well models translate inter-sentential phenomena Measure accuracy of translating English pronoun it to its German counterparts es, er and sie using a contrastive test set [M¨ uller et al., 2018] Perform subjective evaluation in terms of adequacy and fluency [L¨ aubli et al., 2018]

20 / 31

slide-63
SLIDE 63

Experiments and Analyses

Accuracy of pronoun translation vs. antecedent distance

21 / 31

slide-64
SLIDE 64

Experiments and Analyses

Accuracy of pronoun translation vs. antecedent distance

>3 0.5 0.6 0.7 0.8

0.59 0.64

antecedent distance Accuracy

Transformer 21 / 31

slide-65
SLIDE 65

Experiments and Analyses

Accuracy of pronoun translation vs. antecedent distance

>3 0.5 0.6 0.7 0.8

0.59 0.64 0.72 0.66 0.73 0.66 0.69 0.68 0.69 0.66 0.71 0.69

antecedent distance Accuracy

Transformer [Miculicich et al., 2018] Attention(sent) Attention(word) H-Attention(sp-soft) H-Attention(sp-sp) 21 / 31

slide-66
SLIDE 66

Experiments and Analyses

Accuracy of pronoun translation vs. antecedent distance

>3 0.5 0.6 0.7 0.8

0.59 0.64 0.72 0.66 0.73 0.66 0.69 0.68 0.69 0.66 0.71 0.69

antecedent distance Accuracy

Transformer [Miculicich et al., 2018] Attention(sent) Attention(word) H-Attention(sp-soft) H-Attention(sp-sp) 21 / 31

slide-67
SLIDE 67

Experiments and Analyses

Accuracy of pronoun translation vs. antecedent distance

>3 0.5 0.6 0.7 0.8

0.59 0.64 0.72 0.66 0.73 0.66 0.69 0.68 0.69 0.66 0.71 0.69

antecedent distance Accuracy

Transformer [Miculicich et al., 2018] Attention(sent) Attention(word) H-Attention(sp-soft) H-Attention(sp-sp) 21 / 31

slide-68
SLIDE 68

Experiments and Analyses

Model Complexity

Model #Params Speed (words/sec.) Training Decoding Transformer 50M 5100 86.33 +Attention, sentence 53.7M 3750 83.84

word

53.7M 3100 81.38 +H-Attention 54.2M 2600 74.11

slide-69
SLIDE 69

Experiments and Analyses

Model Complexity

Model #Params Speed (words/sec.) Training Decoding Transformer 50M 5100 86.33 +Attention, sentence 53.7M 3750 83.84

word

53.7M 3100 81.38 +H-Attention 54.2M 2600 74.11

slide-70
SLIDE 70

Experiments and Analyses

Model Complexity

Model #Params Speed (words/sec.) Training Decoding Transformer 50M 5100 86.33 +Attention, sentence 53.7M 3750 83.84

word

53.7M 3100 81.38 +H-Attention 54.2M 2600 74.11

slide-71
SLIDE 71

Experiments and Analyses

Model Complexity

Model #Params Speed (words/sec.) Training Decoding Transformer 50M 5100 86.33 +Attention, sentence 53.7M 3750 83.84

word

53.7M 3100 81.38 +H-Attention 54.2M 2600 74.11 [Miculicich et al., 2018] 54.8M 1650 76.90

22 / 31

slide-72
SLIDE 72

Experiments and Analyses

Model Complexity

Model #Params Speed (words/sec.) Training Decoding Transformer 50M 5100 86.33 +Attention, sentence 53.7M 3750 83.84

word

53.7M 3100 81.38 +H-Attention 54.2M 2600 74.11 [Miculicich et al., 2018] 54.8M 1650 76.90

22 / 31

slide-73
SLIDE 73

Experiments and Analyses

Qualitative Analysis

Src: Croatia is their homeland , too . Tgt: Kroatien ist auch ihre Heimat . Transformer: Kroatien ist auch seine Heimat . Our Model: Kroatien ist auch ihr Heimatland .

23 / 31

slide-74
SLIDE 74

Experiments and Analyses

Qualitative Analysis

Src: Croatia is their homeland , too . Tgt: Kroatien ist auch ihre Heimat . Transformer: Kroatien ist auch seine Heimat . Our Model: Kroatien ist auch ihr Heimatland . Head 8: Top sentences with attention to words related to the antecedent sj−1: to name but a few , these include cooperation with the Hague Tribunal , efforts made so far in prosecuting corruption , restructuring the economy and finances and greater commitment and sincerity in eliminating the obstacles to the return

  • f Croatia ’s Serbian

population . sj−4: by signing a border arbitration agreement with its neighbour Slovenia , the new Croatian Government has not only eliminated an obstacle to the negotiating process , but has also paved the way for the resolution of other issues .

23 / 31

slide-75
SLIDE 75

Experiments and Analyses

Qualitative Analysis

Src: Croatia is their homeland , too . Tgt: Kroatien ist auch ihre Heimat . Transformer: Kroatien ist auch seine Heimat . Our Model: Kroatien ist auch ihr Heimatland . Head 8: Top sentences with attention to words related to the antecedent sj−1: to name but a few , these include cooperation with the Hague Tribunal , efforts made so far in prosecuting corruption , restructuring the economy and finances and greater commitment and sincerity in eliminating the obstacles to the return

  • f Croatia ’s Serbian

population . sj−4: by signing a border arbitration agreement with its neighbour Slovenia , the new Croatian Government has not only eliminated an obstacle to the negotiating process , but has also paved the way for the resolution of other issues .

23 / 31

slide-76
SLIDE 76

Experiments and Analyses

Qualitative Analysis

Src: Croatia is their homeland , too . Tgt: Kroatien ist auch ihre Heimat . Transformer: Kroatien ist auch seine Heimat . Our Model: Kroatien ist auch ihr Heimatland . Head 8: Top sentences with attention to words related to the antecedent sj−1: to name but a few , these include cooperation with the Hague Tribunal , efforts made so far in prosecuting corruption , restructuring the economy and finances and greater commitment and sincerity in eliminating the obstacles to the return

  • f Croatia ’s Serbian

population . sj−4: by signing a border arbitration agreement with its neighbour Slovenia , the new Croatian Government has not only eliminated an obstacle to the negotiating process , but has also paved the way for the resolution of other issues .

23 / 31

slide-77
SLIDE 77

Experiments and Analyses

Qualitative Analysis

Src: Croatia is their homeland , too . Tgt: Kroatien ist auch ihre Heimat . Transformer: Kroatien ist auch seine Heimat . Our Model: Kroatien ist auch ihr Heimatland . Head 8: Top sentences with attention to words related to the antecedent sj−1: to name but a few , these include cooperation with the Hague Tribunal , efforts made so far in prosecuting corruption , restructuring the economy and finances and greater commitment and sincerity in eliminating the obstacles to the return

  • f Croatia ’s Serbian

population . sj−4: by signing a border arbitration agreement with its neighbour Slovenia , the new Croatian Government has not only eliminated an obstacle to the negotiating process , but has also paved the way for the resolution of other issues .

23 / 31

slide-78
SLIDE 78

Summary

Overview

1

The Whys?

2

Proposed Approach

3

Experiments and Analyses

4

Summary

24 / 31

slide-79
SLIDE 79

Summary

Summary

25 / 31

slide-80
SLIDE 80

Summary

Summary

Proposed a novel and scalable top-down approach to hierarchical attention for document NMT Our experiments in two document MT settings show that our approach surpasses context-agnostic and context-aware baselines in majority cases

25 / 31

slide-81
SLIDE 81

Summary

Summary

Proposed a novel and scalable top-down approach to hierarchical attention for document NMT Our experiments in two document MT settings show that our approach surpasses context-agnostic and context-aware baselines in majority cases Future Work: Investigate benefits of sparse attention in terms of better interpretability of context-aware NMT models

25 / 31

slide-82
SLIDE 82

References

References I

Jean, S. and Lauly, L. and Firat, O. and Cho, K. (2017). Does Neural Machine Translation Benefit from Larger Context? arXiv:1704.05135. Wang, L. and Tu, Z. and Way, A. and Liu, Q. (2017). Exploiting Cross-Sentence Context for Neural Machine Translation. Proceedings of the Conference on Empirical Methods in Natural Language Processing. Bawden, R. and Sennrich, R. and Birch, A. and Haddow, B. (2018). Evaluating Discourse Phenomena in Neural Machine Translation. Proceedings of the NAACL-HLT 2018. Voita, E. and Serdyukov, P. and Sennrich, R. and Titov, I. (2018). Context-aware neural machine translation learns anaphora resolution. Proceedings of ACL 2018. Tu, Z. and Liu, Y. and Shi, S. and Zhang, T. (2018). Learning to Remember Translation History with a Continuous Cache. Proceedings of TACL 2018. Zhang, J., Luan, H., Sun, M., Zhai, F., Xu, J., Zhang, M., and Liu, Y. (2018). Improving the transformer translation model with document-level context. Proceedings of EMNLP 2018. Miculicich, L., Ram, D., Pappas, N., and Henderson, J. (2018). Document-level neural machine translation with hierarchical attention networks. Proceedings of EMNLP 2018. 26 / 31

slide-83
SLIDE 83

References

References II

Maruf, S. and Haffari, G. (2018). Document Context Neural Machine Translation with Memory Networks. Proceedings of ACL 2018. Neubig, G. and Dyer, C. and Goldberg, Y. and Matthews, A. and Ammar, W. and Anastasopoulos, A. and Ballesteros,

  • M. and Chiang, D. and Clothiaux, D. and Cohn, T. and Duh, K. and Faruqui, M. and Gan, C. and Garrette, D. and Ji,
  • Y. and Kong, L. and Kuncoro, A. and Kumar, G. and Malaviya, C. and Michel, P. and Oda, Y. and Richardson, M. and

Saphra, N. and Swayamdipta, S. and Yin, P. (2017). DyNet: The Dynamic Neural Network Toolkit. M¨ uller, M. and Rios, A. and Voita, E. and Sennrich, R. (2018). A large-scale test set for the evaluation of context-aware pronoun translation in neural machine translation. Proceedings of WMT 2018. L¨ aubli, S. and Sennrich, R. and Volk, M. (2018). Has machine translation achieved human parity? A case for document-level evaluation. Proceedings of EMNLP 2018. 27 / 31

slide-84
SLIDE 84

References

Implementation and Hyperparameters

Implementation: DyNet C++ interface [Neubig et al., 2017], using Transformer-DyNet (https://github.com/duyvuleo/Transformer-DyNet)

Parameters Details #Layers 4 #Heads 8 Hidden dimensions 512 Feed-forward layer size 2048 Optimizer Adam (lr=0.0001) Dropout (Base model) 0.1 Dropout (Document-level model) 0.2 Label smoothing 0.1

Src/Tgt vocab sizes: TED 17.1k/23.2k, News 16.9k/23.3k, Europarl 16.6k/25.4k (Joint BPE vocab size 30k)

28 / 31

slide-85
SLIDE 85

References

Monolingual Context Integration in Encoder

29 / 31

slide-86
SLIDE 86

References

Bilingual Context Integration in Decoder

30 / 31

slide-87
SLIDE 87

References

Qualitative Analysis

Src: my thoughts are also with the victims . Ref: meine Gedanken sind auch bei den Opfern . Transformer: ich denke auch an die Opfer . Our Model: meine Gedanken sind auch bei den Opfern .

31 / 31

slide-88
SLIDE 88

References

Qualitative Analysis

Src: my thoughts are also with the victims . Ref: meine Gedanken sind auch bei den Opfern . Transformer: ich denke auch an die Opfer . Our Model: meine Gedanken sind auch bei den Opfern . Head 2: Top sentences with attention to related words sj−2: ( FR ) Madam President , many things have already been said , but I would like to echo all the words

  • f sympathy and support that have already been

addressed to the peoples of Tunisia and Egypt . sj+4: it must implement a strong strategy towards these countries . sj−1: they are a symbol

  • f

hope for all those who defend freedom .

31 / 31