online versus offline nmt quality
play

Online Versus Offline NMT Quality An In-depth Analysis on - PowerPoint PPT Presentation

Online Versus Offline NMT Quality An In-depth Analysis on English-German and German-English Maha Elbayad 1,2 Michael Ustaszewski 3 Emmanuelle Esperana-Rodier 1 Francis Brunet-Manquat 1 Jakob Verbeek 4 Laurent Besacier 1 (1) (2) (3) (4)


  1. Online Versus Offline NMT Quality An In-depth Analysis on English-German and German-English Maha Elbayad 1,2 Michael Ustaszewski 3 Emmanuelle Esperança-Rodier 1 Francis Brunet-Manquat 1 Jakob Verbeek 4 Laurent Besacier 1 (1) (2) (3) (4)

  2. Introduction Online NMT models Automatic Evaluation Human Evaluation Outline 1 Introduction to online translation 2 Neural architectures for online NMT a Transformer (Vaswani et al. 2017) b Pervasive Attention (Elbayad et al. 2018) 3 Automatic evaluation 4 Human evaluation 5 Conclusion Elbayad et al. Online vs. Offline NMT Quality 1 / 16

  3. Introduction Online NMT models Automatic Evaluation Human Evaluation Online Neural Machine Translation source source x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 1 x 2 x 3 x 4 x 5 x 6 x 7 < / s> < / s> <s> <s> y 1 y 1 y 2 y 2 y 3 y 3 target target y 4 y 4 y 5 y 5 y 6 y 6 y 7 y 7 y 8 y 8 < / s> < / s> Offline translation Online translation Elbayad et al. Online vs. Offline NMT Quality 2 / 16

  4. Introduction Online NMT models Automatic Evaluation Human Evaluation Wait- k Decoders for Online Translation z wait- k ∀ t ∈ [ 1 .. | y | ] , = min( k + t − 1 , | x | ) t source source source x 1 x 2 x 3 x 4 x 5 < x 1 x 2 x 3 x 4 x 5 < x 1 x 2 x 3 x 4 x 5 < / s> / s> / s> <s> <s> <s> y 1 y 1 y 1 y 2 y 2 y 2 target y 3 y 3 y 3 y 4 y 4 y 4 y 5 y 5 y 5 < / s> < / s> < / s> Wait-1 Wait-3 Wait- ∞ Wait-k or prefix-to-prefix decoding (Dalvi et al. 2018; Ma et al. 2019; Elbayad et al. 2020) Elbayad et al. Online vs. Offline NMT Quality 3 / 16

  5. Online NMT models Introduction Automatic Evaluation Human Evaluation Online Transformer ◮ Unidirectional encoder (Elbayad et al. 2020) s 3 x x Encoder states Source tokens x 1 x 2 x 3 x 4 x 5 x 6 x 1 x 2 x 3 x 4 x 5 x 6 z t = 4 z t +1 = 5 Elbayad et al. Online vs. Offline NMT Quality 4 / 16

  6. Online NMT models Introduction Automatic Evaluation Human Evaluation Online Transformer ◮ Unidirectional encoder (Elbayad et al. 2020) s 3 x x Encoder states Source tokens x 1 x 2 x 3 x 4 x 5 x 6 x 1 x 2 x 3 x 4 x 5 x 6 z t = 4 z t +1 = 5 ◮ Masked decoder - masking the attention energies wrt. z t h t − 1 , z t = 4 x Decoder Encoder s 1 s 2 s 3 s 4 s 5 s 6 Elbayad et al. Online vs. Offline NMT Quality 4 / 16

  7. Online NMT models Introduction Automatic Evaluation Human Evaluation The Pervasive Attention Architecture A ) g g x r e ( g a t i o e n c r u o s p ( y 1 | y < 1 , x ) · · . . . target ( y ) · · · p ( y | y | | y < | y | , x ) H out H conv H 0 H 1 H 1 H 2 H N Concatenated Convolutional source-target Elbayad et al. 2018 feature maps embeddings Elbayad et al. Online vs. Offline NMT Quality 5 / 16

  8. Online NMT models Introduction Automatic Evaluation Human Evaluation Online Pervasive Attention x z t source W y t − 1 target y t 2D causal convolution Features aggregation Elbayad et al. Online vs. Offline NMT Quality 6 / 16

  9. Online NMT models Introduction Automatic Evaluation Human Evaluation Online Pervasive Attention x z t source W y t − 1 target y t + Masking the future source for The appropriate context size z t is unidirectional encoding. controlled during aggregation. Elbayad et al. Online vs. Offline NMT Quality 6 / 16

  10. Online NMT models Introduction Automatic Evaluation Human Evaluation Training and Evaluation Setup Data ◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K. Elbayad et al. Online vs. Offline NMT Quality 7 / 16

  11. Online NMT models Introduction Automatic Evaluation Human Evaluation Training and Evaluation Setup Data ◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K. Elbayad et al. Online vs. Offline NMT Quality 7 / 16

  12. Online NMT models Introduction Automatic Evaluation Human Evaluation Training and Evaluation Setup Data ◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K. Elbayad et al. Online vs. Offline NMT Quality 7 / 16

  13. Online NMT models Introduction Automatic Evaluation Human Evaluation Training and Evaluation Setup Data ◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K. Elbayad et al. Online vs. Offline NMT Quality 7 / 16

  14. Online NMT models Introduction Automatic Evaluation Human Evaluation Training and Evaluation Setup Data ◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K. Elbayad et al. Online vs. Offline NMT Quality 7 / 16

  15. Online NMT models Introduction Automatic Evaluation Human Evaluation Training and Evaluation Setup Data ◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K. Elbayad et al. Online vs. Offline NMT Quality 7 / 16

  16. Online NMT models Introduction Automatic Evaluation Human Evaluation Training and Evaluation Setup Data ◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K. Models ◮ For each direction and for each architecture, an online and an offline model. ◮ Pervasive Attention ( PA ) with 14 layers and 7 × 7 filters (effectively 4 × 4). ◮ Transformer ( TF ) small. ◮ Online trained with k train = 7 and evaluated with k eval = 3. ◮ Greedy decoding for all. Elbayad et al. Online vs. Offline NMT Quality 7 / 16

  17. Online NMT models Introduction Automatic Evaluation Human Evaluation Training and Evaluation Setup Data ◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K. Models ◮ For each direction and for each architecture, an online and an offline model. ◮ Pervasive Attention ( PA ) with 14 layers and 7 × 7 filters (effectively 4 × 4). ◮ Transformer ( TF ) small. ◮ Online trained with k train = 7 and evaluated with k eval = 3. ◮ Greedy decoding for all. Elbayad et al. Online vs. Offline NMT Quality 7 / 16

  18. Online NMT models Introduction Automatic Evaluation Human Evaluation Training and Evaluation Setup Data ◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K. Models ◮ For each direction and for each architecture, an online and an offline model. ◮ Pervasive Attention ( PA ) with 14 layers and 7 × 7 filters (effectively 4 × 4). ◮ Transformer ( TF ) small. ◮ Online trained with k train = 7 and evaluated with k eval = 3. ◮ Greedy decoding for all. Elbayad et al. Online vs. Offline NMT Quality 7 / 16

  19. Online NMT models Introduction Automatic Evaluation Human Evaluation Training and Evaluation Setup Data ◮ IWSLT’14 De-En and En-De (Cettolo et al. 2014). ◮ Sentences >175 words and pairs with length-ratio >1.5 are removed. ◮ The data is tokenized but not lowercased. ◮ The sequences are BPE segmented (Sennrich et al. 2016) → 32K vocabulary. ◮ Training = 160K, development = 7.3K and test = 6.7K. Models ◮ For each direction and for each architecture, an online and an offline model. ◮ Pervasive Attention ( PA ) with 14 layers and 7 × 7 filters (effectively 4 × 4). ◮ Transformer ( TF ) small. ◮ Online trained with k train = 7 and evaluated with k eval = 3. ◮ Greedy decoding for all. Elbayad et al. Online vs. Offline NMT Quality 7 / 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend