Parameter Sharing Methods for Multilingual Self-Attentional - PowerPoint PPT Presentation

Parameter Sharing Methods for Multilingual Self-Attentional Translation Models Devendra Sachan 1 Graham Neubig 2 1 Data Solutions Team, Petuum Inc, USA 2 Language Technologies Institute, Carnegie Mellon University, USA Conference on Machine Translation , Nov 2018

Multilingual Machine Translation English English German German Multilingual Machine Translation System Dutch Dutch Japanese Japanese ◮ Goal: Train a machine learning system to translate from multiple source languages to multiple target languages.

Multilingual Machine Translation English English German German Multilingual Machine Translation System Dutch Dutch Japanese Japanese ◮ Goal: Train a machine learning system to translate from multiple source languages to multiple target languages. ◮ Multilingual models follow the multi-task learning (MTL) paradigm

Multilingual Machine Translation English English German German Multilingual Machine Translation System Dutch Dutch Japanese Japanese ◮ Goal: Train a machine learning system to translate from multiple source languages to multiple target languages. ◮ Multilingual models follow the multi-task learning (MTL) paradigm 1. Models are jointly trained on data from several language pairs.

Multilingual Machine Translation English English German German Multilingual Machine Translation System Dutch Dutch Japanese Japanese ◮ Goal: Train a machine learning system to translate from multiple source languages to multiple target languages. ◮ Multilingual models follow the multi-task learning (MTL) paradigm 1. Models are jointly trained on data from several language pairs. 2. Incorporate some degree of parameter sharing.

One-to-Many Multilingual Translation German Multilingual Machine English Translation System Dutch ◮ Translation from a common source language (“En”) to multiple target languages (“De” and “Nl”)

One-to-Many Multilingual Translation German Multilingual Machine English Translation System Dutch ◮ Translation from a common source language (“En”) to multiple target languages (“De” and “Nl”) ◮ Difficult task as we need to translate to (or generate) multiple target languages.

Previous Approach: Separate Decoders Target Language 1: "De" Decoder 1 Shared Encoder Source Language: "En" Decoder 2 Target Language 2: "Nl" ◮ One shared encoder and one decoder per target language. 1 1 Multi-Task Learning for Multiple Language Translation, ACL 2015

Previous Approach: Separate Decoders Target Language 1: "De" Decoder 1 Shared Encoder Source Language: "En" Decoder 2 Target Language 2: "Nl" ◮ One shared encoder and one decoder per target language. 1 ◮ Advantage: ability to model each target language separately. 1 Multi-Task Learning for Multiple Language Translation, ACL 2015

Previous Approach: Separate Decoders Target Language 1: "De" Decoder 1 Shared Encoder Source Language: "En" Decoder 2 Target Language 2: "Nl" ◮ One shared encoder and one decoder per target language. 1 ◮ Advantage: ability to model each target language separately. ◮ Disadvantages: 1. Slower Training 1 Multi-Task Learning for Multiple Language Translation, ACL 2015

Previous Approach: Separate Decoders Target Language 1: "De" Decoder 1 Shared Encoder Source Language: "En" Decoder 2 Target Language 2: "Nl" ◮ One shared encoder and one decoder per target language. 1 ◮ Advantage: ability to model each target language separately. ◮ Disadvantages: 1. Slower Training 2. Increased memory requirements 1 Multi-Task Learning for Multiple Language Translation, ACL 2015

Previous Approach: Shared Decoder Target Language 1: "De" Shared Encoder Shared Decoder Source Language: "En" Target Language 2: "Nl" ◮ Single unified model: shared encoder and shared decoder for all language pairs. 2 2 Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation, ACL 2017

Previous Approach: Shared Decoder Target Language 1: "De" Shared Encoder Shared Decoder Source Language: "En" Target Language 2: "Nl" ◮ Single unified model: shared encoder and shared decoder for all language pairs. 2 ◮ Advantages: ◮ Trivially implementable: using a standard bilingual translation model. 2 Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation, ACL 2017

Previous Approach: Shared Decoder Target Language 1: "De" Shared Encoder Shared Decoder Source Language: "En" Target Language 2: "Nl" ◮ Single unified model: shared encoder and shared decoder for all language pairs. 2 ◮ Advantages: ◮ Trivially implementable: using a standard bilingual translation model. ◮ Constant number of trainable parameters. 2 Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation, ACL 2017

Previous Approach: Shared Decoder Target Language 1: "De" Shared Encoder Shared Decoder Source Language: "En" Target Language 2: "Nl" ◮ Single unified model: shared encoder and shared decoder for all language pairs. 2 ◮ Advantages: ◮ Trivially implementable: using a standard bilingual translation model. ◮ Constant number of trainable parameters. ◮ Disadvantage: decoder’s ability to model multiple languages can be significantly reduced. 2 Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation, ACL 2017

Our Proposed Approach: Partial Sharing Target Language 1: "De" Decoder 1 Shareable Shared Encoder Parameters Source Language: "En" Decoder 2 Target Language 2: "Nl" ◮ Share some but not all parameters.

Our Proposed Approach: Partial Sharing Target Language 1: "De" Decoder 1 Shareable Shared Encoder Parameters Source Language: "En" Decoder 2 Target Language 2: "Nl" ◮ Share some but not all parameters. ◮ Generalizes previous approaches.

Our Proposed Approach: Partial Sharing Target Language 1: "De" Decoder 1 Shareable Shared Encoder Parameters Source Language: "En" Decoder 2 Target Language 2: "Nl" ◮ Share some but not all parameters. ◮ Generalizes previous approaches. ◮ We focus on the self-attentional Transformer model.

Transformer Model 3 Tied Linear Layer W ⊺ E Layer Norm N × Feed-Forward Network Sublayer W L 2 ReLU W L 1 Layer Norm z i Encoder-Decoder W 2 Attention Sublayer F a i Enc-Dec Inter Attention k i v i q i W 2 W 2 W 2 K V Q h i Encoder Layer Norm Hidden State z i Self-Attention W 1 Sublayer F a i Masked Self-Attention v i q i k i W 1 W 1 W 1 K V Q Layer Norm x i Position Encoding W E Embedding Layer 3 Attention is all you need, NIPS 2017

Transformer Model 3 Tied Linear Layer W ⊺ E Layer Norm N × Feed-Forward Network Sublayer W L 2 ReLU ◮ Embedding Layer W L 1 Layer Norm z i Encoder-Decoder W 2 Attention Sublayer F a i Enc-Dec Inter Attention k i v i q i W 2 W 2 W 2 K V Q h i Encoder Layer Norm Hidden State z i Self-Attention W 1 Sublayer F a i Masked Self-Attention v i q i k i W 1 W 1 W 1 K V Q Layer Norm x i Position Encoding W E Embedding Layer 3 Attention is all you need, NIPS 2017

Transformer Model 3 Tied Linear Layer W ⊺ E Layer Norm N × Feed-Forward Network Sublayer W L 2 ReLU ◮ Embedding Layer W L 1 Layer Norm ◮ Encoder Layer (2 sublayers) z i Encoder-Decoder W 2 Attention Sublayer F a i Enc-Dec Inter Attention k i v i q i W 2 W 2 W 2 K V Q h i Encoder Layer Norm Hidden State z i Self-Attention W 1 Sublayer F a i Masked Self-Attention v i q i k i W 1 W 1 W 1 K V Q Layer Norm x i Position Encoding W E Embedding Layer 3 Attention is all you need, NIPS 2017

Transformer Model 3 Tied Linear Layer W ⊺ E Layer Norm N × Feed-Forward Network Sublayer W L 2 ReLU ◮ Embedding Layer W L 1 Layer Norm ◮ Encoder Layer (2 sublayers) z i 1. Self-attention Encoder-Decoder W 2 Attention Sublayer F a i Enc-Dec Inter Attention k i v i q i W 2 W 2 W 2 K V Q h i Encoder Layer Norm Hidden State z i Self-Attention W 1 Sublayer F a i Masked Self-Attention v i q i k i W 1 W 1 W 1 K V Q Layer Norm x i Position Encoding W E Embedding Layer 3 Attention is all you need, NIPS 2017

Transformer Model 3 Tied Linear Layer W ⊺ E Layer Norm N × Feed-Forward Network Sublayer W L 2 ReLU ◮ Embedding Layer W L 1 Layer Norm ◮ Encoder Layer (2 sublayers) z i 1. Self-attention Encoder-Decoder W 2 Attention Sublayer F a i 2. Feed-forward network Enc-Dec Inter Attention k i v i q i W 2 W 2 W 2 K V Q h i Encoder Layer Norm Hidden State z i Self-Attention W 1 Sublayer F a i Masked Self-Attention v i q i k i W 1 W 1 W 1 K V Q Layer Norm x i Position Encoding W E Embedding Layer 3 Attention is all you need, NIPS 2017

Transformer Model 3 Tied Linear Layer W ⊺ E Layer Norm N × Feed-Forward Network Sublayer W L 2 ReLU ◮ Embedding Layer W L 1 Layer Norm ◮ Encoder Layer (2 sublayers) z i 1. Self-attention Encoder-Decoder W 2 Attention Sublayer F a i 2. Feed-forward network Enc-Dec Inter Attention k i v i q i ◮ Decoder Layer (3 sublayers) W 2 W 2 W 2 K V Q h i Encoder Layer Norm Hidden State z i Self-Attention W 1 Sublayer F a i Masked Self-Attention v i q i k i W 1 W 1 W 1 K V Q Layer Norm x i Position Encoding W E Embedding Layer 3 Attention is all you need, NIPS 2017

Parameter Sharing Methods for Multilingual Self-Attentional - PowerPoint PPT Presentation

Parameter Sharing Methods for Multilingual Self-Attentional Translation Models Devendra Sachan 1 Graham Neubig 2 1 Data Solutions Team, Petuum Inc, USA 2 Language Technologies Institute, Carnegie Mellon University, USA Conference on Machine

Drupal 8s multilingual APIs Gbor Hojtsy DRUPAL 7 MULTILINGUAL DRUPAL 7 MULTILINGUAL Drupal

Drupal 8 Multilingual Wonderland Gabor Hojtsy Acquia Foreign language site Multilingual site

6. Parameter Passing Parameter Passing CS 381 Spring 2016 Example (Formal) Parameter void

10/16/19 Parameter Control Genetic Algorithms Motivation Parameter setting Tuning

Multilingual App Toolkit Standards and multilingual software development 29, April 2015 Jan

Secret Sharing and Visual Cryptography Outline Secret Sharing Visual Secret Sharing

Parameter Passing and Pointers Parameter passing and functions I: reference parameters

10/16/19 Parameters and Parameter Tuning Genetic Algorithms History Taxonomy

Advanced Tools from Modern Cryptography Lecture 3 Secret-Sharing (ctd.) Secret-Sharing Last

Monitoring and analysing multilingual media reports Monitoring and analysing multilingual media

Multilingual User Generated Content at Wikipedia Alolita Sharma Director of Language Engineering

Multilingual Web: Affordable for SMEs and Small Organizations? Multilingual Communication

Verbs in the Open Multilingual Wordnet Francis Bond Linguistics and Multilingual Studies,

From multilingual documents to multilingual websites: challenges for international organizations

Creating Multilingual Creating Multilingual Drupal 7 Websites: Drupal 7 Websites: Part 2 Part

Standards for multilingual web sites MultilingualWeb.eu, 4-5 April 2011, Pisa, Italy M.T.

NISHIMURA, Nobuya National Astronomical Observatory of Japan Collaborator s

Low-discrepancy point sets lifted to the unit sphere Johann S. Brauchart School of Mathematics

Strange meson production near threshold Strange meson production near threshold FOPI FOPI in

Algebras and Coalgebras in dependent type theory Anton Setzer Swansea University 12 April 2012

Hadronic matrix elements for Dark Matter and other searches Laurent Lellouch CPT Marseille

Nucleon PDF separation with the collider and fixed-target data S.Alekhin ( IHEP Protvino) Theory:

IMAGE CLASSIFICATION WITH NVIDIA DIGITS Pedro Mario Cruz e Silva (pcruzesilva@nvidia.com)

GPU Programming 101 GridKa School 2017: make science && run Andreas Herten ,

Parameter Sharing Methods for Multilingual Self-Attentional - PowerPoint PPT Presentation

Parameter Sharing Methods for Multilingual Self-Attentional Translation Models Devendra Sachan 1 Graham Neubig 2 1 Data Solutions Team, Petuum Inc, USA 2 Language Technologies Institute, Carnegie Mellon University, USA Conference on Machine

Drupal 8s multilingual APIs Gbor Hojtsy DRUPAL 7 MULTILINGUAL DRUPAL 7 MULTILINGUAL Drupal

Drupal 8 Multilingual Wonderland Gabor Hojtsy Acquia Foreign language site Multilingual site

6. Parameter Passing Parameter Passing CS 381 Spring 2016 Example (Formal) Parameter void

10/16/19 Parameter Control Genetic Algorithms Motivation Parameter setting Tuning

Multilingual App Toolkit Standards and multilingual software development 29, April 2015 Jan

Secret Sharing and Visual Cryptography Outline Secret Sharing Visual Secret Sharing

Parameter Passing and Pointers Parameter passing and functions I: reference parameters

10/16/19 Parameters and Parameter Tuning Genetic Algorithms History Taxonomy

Advanced Tools from Modern Cryptography Lecture 3 Secret-Sharing (ctd.) Secret-Sharing Last

Monitoring and analysing multilingual media reports Monitoring and analysing multilingual media

Multilingual User Generated Content at Wikipedia Alolita Sharma Director of Language Engineering

Multilingual Web: Affordable for SMEs and Small Organizations? Multilingual Communication

Verbs in the Open Multilingual Wordnet Francis Bond Linguistics and Multilingual Studies,

From multilingual documents to multilingual websites: challenges for international organizations

Creating Multilingual Creating Multilingual Drupal 7 Websites: Drupal 7 Websites: Part 2 Part

Standards for multilingual web sites MultilingualWeb.eu, 4-5 April 2011, Pisa, Italy M.T.

NISHIMURA, Nobuya National Astronomical Observatory of Japan Collaborator s

Low-discrepancy point sets lifted to the unit sphere Johann S. Brauchart School of Mathematics

Strange meson production near threshold Strange meson production near threshold FOPI FOPI in

Algebras and Coalgebras in dependent type theory Anton Setzer Swansea University 12 April 2012

Hadronic matrix elements for Dark Matter and other searches Laurent Lellouch CPT Marseille

Nucleon PDF separation with the collider and fixed-target data S.Alekhin ( IHEP Protvino) Theory:

IMAGE CLASSIFICATION WITH NVIDIA DIGITS Pedro Mario Cruz e Silva (pcruzesilva@nvidia.com)

GPU Programming 101 GridKa School 2017: make science &amp;&amp; run Andreas Herten ,

GPU Programming 101 GridKa School 2017: make science && run Andreas Herten ,