Parameter Sharing Methods for Multilingual Self-Attentional - - PowerPoint PPT Presentation

parameter sharing methods for multilingual self
SMART_READER_LITE
LIVE PREVIEW

Parameter Sharing Methods for Multilingual Self-Attentional - - PowerPoint PPT Presentation

Parameter Sharing Methods for Multilingual Self-Attentional Translation Models Devendra Sachan 1 Graham Neubig 2 1 Data Solutions Team, Petuum Inc, USA 2 Language Technologies Institute, Carnegie Mellon University, USA Conference on Machine


slide-1
SLIDE 1

Parameter Sharing Methods for Multilingual Self-Attentional Translation Models

Devendra Sachan1 Graham Neubig2

1Data Solutions Team,

Petuum Inc, USA

2Language Technologies Institute,

Carnegie Mellon University, USA

Conference on Machine Translation, Nov 2018

slide-2
SLIDE 2

Multilingual Machine Translation

Multilingual Machine Translation System

English German Dutch Japanese English German Dutch Japanese

◮ Goal: Train a machine learning system to translate from multiple source languages to multiple target languages.

slide-3
SLIDE 3

Multilingual Machine Translation

Multilingual Machine Translation System

English German Dutch Japanese English German Dutch Japanese

◮ Goal: Train a machine learning system to translate from multiple source languages to multiple target languages. ◮ Multilingual models follow the multi-task learning (MTL) paradigm

slide-4
SLIDE 4

Multilingual Machine Translation

Multilingual Machine Translation System

English German Dutch Japanese English German Dutch Japanese

◮ Goal: Train a machine learning system to translate from multiple source languages to multiple target languages. ◮ Multilingual models follow the multi-task learning (MTL) paradigm

  • 1. Models are jointly trained on data from several language pairs.
slide-5
SLIDE 5

Multilingual Machine Translation

Multilingual Machine Translation System

English German Dutch Japanese English German Dutch Japanese

◮ Goal: Train a machine learning system to translate from multiple source languages to multiple target languages. ◮ Multilingual models follow the multi-task learning (MTL) paradigm

  • 1. Models are jointly trained on data from several language pairs.
  • 2. Incorporate some degree of parameter sharing.
slide-6
SLIDE 6

One-to-Many Multilingual Translation

Multilingual Machine Translation System

English German Dutch

◮ Translation from a common source language (“En”) to multiple target languages (“De” and “Nl”)

slide-7
SLIDE 7

One-to-Many Multilingual Translation

Multilingual Machine Translation System

English German Dutch

◮ Translation from a common source language (“En”) to multiple target languages (“De” and “Nl”) ◮ Difficult task as we need to translate to (or generate) multiple target languages.

slide-8
SLIDE 8

Previous Approach: Separate Decoders

Shared Encoder Decoder 2 Source Language: "En" Target Language 2: "Nl" Decoder 1 Target Language 1: "De"

◮ One shared encoder and one decoder per target language.1

1Multi-Task Learning for Multiple Language Translation, ACL 2015

slide-9
SLIDE 9

Previous Approach: Separate Decoders

Shared Encoder Decoder 2 Source Language: "En" Target Language 2: "Nl" Decoder 1 Target Language 1: "De"

◮ One shared encoder and one decoder per target language.1 ◮ Advantage: ability to model each target language separately.

1Multi-Task Learning for Multiple Language Translation, ACL 2015

slide-10
SLIDE 10

Previous Approach: Separate Decoders

Shared Encoder Decoder 2 Source Language: "En" Target Language 2: "Nl" Decoder 1 Target Language 1: "De"

◮ One shared encoder and one decoder per target language.1 ◮ Advantage: ability to model each target language separately. ◮ Disadvantages:

  • 1. Slower Training

1Multi-Task Learning for Multiple Language Translation, ACL 2015

slide-11
SLIDE 11

Previous Approach: Separate Decoders

Shared Encoder Decoder 2 Source Language: "En" Target Language 2: "Nl" Decoder 1 Target Language 1: "De"

◮ One shared encoder and one decoder per target language.1 ◮ Advantage: ability to model each target language separately. ◮ Disadvantages:

  • 1. Slower Training
  • 2. Increased memory requirements

1Multi-Task Learning for Multiple Language Translation, ACL 2015

slide-12
SLIDE 12

Previous Approach: Shared Decoder

Shared Encoder Source Language: "En" Target Language 2: "Nl" Shared Decoder Target Language 1: "De"

◮ Single unified model: shared encoder and shared decoder for all language pairs.2

2Google’s Multilingual Neural Machine Translation System: Enabling

Zero-Shot Translation, ACL 2017

slide-13
SLIDE 13

Previous Approach: Shared Decoder

Shared Encoder Source Language: "En" Target Language 2: "Nl" Shared Decoder Target Language 1: "De"

◮ Single unified model: shared encoder and shared decoder for all language pairs.2 ◮ Advantages:

◮ Trivially implementable: using a standard bilingual translation model.

2Google’s Multilingual Neural Machine Translation System: Enabling

Zero-Shot Translation, ACL 2017

slide-14
SLIDE 14

Previous Approach: Shared Decoder

Shared Encoder Source Language: "En" Target Language 2: "Nl" Shared Decoder Target Language 1: "De"

◮ Single unified model: shared encoder and shared decoder for all language pairs.2 ◮ Advantages:

◮ Trivially implementable: using a standard bilingual translation model. ◮ Constant number of trainable parameters.

2Google’s Multilingual Neural Machine Translation System: Enabling

Zero-Shot Translation, ACL 2017

slide-15
SLIDE 15

Previous Approach: Shared Decoder

Shared Encoder Source Language: "En" Target Language 2: "Nl" Shared Decoder Target Language 1: "De"

◮ Single unified model: shared encoder and shared decoder for all language pairs.2 ◮ Advantages:

◮ Trivially implementable: using a standard bilingual translation model. ◮ Constant number of trainable parameters.

◮ Disadvantage: decoder’s ability to model multiple languages can be significantly reduced.

2Google’s Multilingual Neural Machine Translation System: Enabling

Zero-Shot Translation, ACL 2017

slide-16
SLIDE 16

Our Proposed Approach: Partial Sharing

Shared Encoder Decoder 2 Source Language: "En" Target Language 2: "Nl" Target Language 1: "De" Decoder 1 Shareable Parameters

◮ Share some but not all parameters.

slide-17
SLIDE 17

Our Proposed Approach: Partial Sharing

Shared Encoder Decoder 2 Source Language: "En" Target Language 2: "Nl" Target Language 1: "De" Decoder 1 Shareable Parameters

◮ Share some but not all parameters. ◮ Generalizes previous approaches.

slide-18
SLIDE 18

Our Proposed Approach: Partial Sharing

Shared Encoder Decoder 2 Source Language: "En" Target Language 2: "Nl" Target Language 1: "De" Decoder 1 Shareable Parameters

◮ Share some but not all parameters. ◮ Generalizes previous approaches. ◮ We focus on the self-attentional Transformer model.

slide-19
SLIDE 19

Transformer Model3

3Attention is all you need, NIPS 2017

Enc-Dec Inter Attention

ki vi qi ai

Layer Norm Layer Norm ReLU Masked Self-Attention Layer Norm

xi W 1

F

W 1

K

W 1

V

ki W 1

Q

vi qi ai W 2

K

W 2

V

W 2

Q

W 2

F

WL1 WL2

Self-Attention Sublayer Encoder-Decoder Attention Sublayer Feed-Forward Network Sublayer

zi hi

Encoder Hidden State

zi

Layer Norm

WE

Embedding Layer Position Encoding

W ⊺

E Tied Linear Layer

slide-20
SLIDE 20

Transformer Model3

◮ Embedding Layer

3Attention is all you need, NIPS 2017

Enc-Dec Inter Attention

ki vi qi ai

Layer Norm Layer Norm ReLU Masked Self-Attention Layer Norm

xi W 1

F

W 1

K

W 1

V

ki W 1

Q

vi qi ai W 2

K

W 2

V

W 2

Q

W 2

F

WL1 WL2

Self-Attention Sublayer Encoder-Decoder Attention Sublayer Feed-Forward Network Sublayer

zi hi

Encoder Hidden State

zi

Layer Norm

WE

Embedding Layer Position Encoding

W ⊺

E Tied Linear Layer

slide-21
SLIDE 21

Transformer Model3

◮ Embedding Layer ◮ Encoder Layer (2 sublayers)

3Attention is all you need, NIPS 2017

Enc-Dec Inter Attention

ki vi qi ai

Layer Norm Layer Norm ReLU Masked Self-Attention Layer Norm

xi W 1

F

W 1

K

W 1

V

ki W 1

Q

vi qi ai W 2

K

W 2

V

W 2

Q

W 2

F

WL1 WL2

Self-Attention Sublayer Encoder-Decoder Attention Sublayer Feed-Forward Network Sublayer

zi hi

Encoder Hidden State

zi

Layer Norm

WE

Embedding Layer Position Encoding

W ⊺

E Tied Linear Layer

slide-22
SLIDE 22

Transformer Model3

◮ Embedding Layer ◮ Encoder Layer (2 sublayers)

  • 1. Self-attention

3Attention is all you need, NIPS 2017

Enc-Dec Inter Attention

ki vi qi ai

Layer Norm Layer Norm ReLU Masked Self-Attention Layer Norm

xi W 1

F

W 1

K

W 1

V

ki W 1

Q

vi qi ai W 2

K

W 2

V

W 2

Q

W 2

F

WL1 WL2

Self-Attention Sublayer Encoder-Decoder Attention Sublayer Feed-Forward Network Sublayer

zi hi

Encoder Hidden State

zi

Layer Norm

WE

Embedding Layer Position Encoding

W ⊺

E Tied Linear Layer

slide-23
SLIDE 23

Transformer Model3

◮ Embedding Layer ◮ Encoder Layer (2 sublayers)

  • 1. Self-attention
  • 2. Feed-forward network

3Attention is all you need, NIPS 2017

Enc-Dec Inter Attention

ki vi qi ai

Layer Norm Layer Norm ReLU Masked Self-Attention Layer Norm

xi W 1

F

W 1

K

W 1

V

ki W 1

Q

vi qi ai W 2

K

W 2

V

W 2

Q

W 2

F

WL1 WL2

Self-Attention Sublayer Encoder-Decoder Attention Sublayer Feed-Forward Network Sublayer

zi hi

Encoder Hidden State

zi

Layer Norm

WE

Embedding Layer Position Encoding

W ⊺

E Tied Linear Layer

slide-24
SLIDE 24

Transformer Model3

◮ Embedding Layer ◮ Encoder Layer (2 sublayers)

  • 1. Self-attention
  • 2. Feed-forward network

◮ Decoder Layer (3 sublayers)

3Attention is all you need, NIPS 2017

Enc-Dec Inter Attention

ki vi qi ai

Layer Norm Layer Norm ReLU Masked Self-Attention Layer Norm

xi W 1

F

W 1

K

W 1

V

ki W 1

Q

vi qi ai W 2

K

W 2

V

W 2

Q

W 2

F

WL1 WL2

Self-Attention Sublayer Encoder-Decoder Attention Sublayer Feed-Forward Network Sublayer

zi hi

Encoder Hidden State

zi

Layer Norm

WE

Embedding Layer Position Encoding

W ⊺

E Tied Linear Layer

slide-25
SLIDE 25

Transformer Model3

◮ Embedding Layer ◮ Encoder Layer (2 sublayers)

  • 1. Self-attention
  • 2. Feed-forward network

◮ Decoder Layer (3 sublayers)

  • 1. Masked self-attention

3Attention is all you need, NIPS 2017

Enc-Dec Inter Attention

ki vi qi ai

Layer Norm Layer Norm ReLU Masked Self-Attention Layer Norm

xi W 1

F

W 1

K

W 1

V

ki W 1

Q

vi qi ai W 2

K

W 2

V

W 2

Q

W 2

F

WL1 WL2

Self-Attention Sublayer Encoder-Decoder Attention Sublayer Feed-Forward Network Sublayer

zi hi

Encoder Hidden State

zi

Layer Norm

WE

Embedding Layer Position Encoding

W ⊺

E Tied Linear Layer

slide-26
SLIDE 26

Transformer Model3

◮ Embedding Layer ◮ Encoder Layer (2 sublayers)

  • 1. Self-attention
  • 2. Feed-forward network

◮ Decoder Layer (3 sublayers)

  • 1. Masked self-attention
  • 2. Encoder-decoder attention

3Attention is all you need, NIPS 2017

Enc-Dec Inter Attention

ki vi qi ai

Layer Norm Layer Norm ReLU Masked Self-Attention Layer Norm

xi W 1

F

W 1

K

W 1

V

ki W 1

Q

vi qi ai W 2

K

W 2

V

W 2

Q

W 2

F

WL1 WL2

Self-Attention Sublayer Encoder-Decoder Attention Sublayer Feed-Forward Network Sublayer

zi hi

Encoder Hidden State

zi

Layer Norm

WE

Embedding Layer Position Encoding

W ⊺

E Tied Linear Layer

slide-27
SLIDE 27

Transformer Model3

◮ Embedding Layer ◮ Encoder Layer (2 sublayers)

  • 1. Self-attention
  • 2. Feed-forward network

◮ Decoder Layer (3 sublayers)

  • 1. Masked self-attention
  • 2. Encoder-decoder attention
  • 3. Feed-forward network

3Attention is all you need, NIPS 2017

Enc-Dec Inter Attention

ki vi qi ai

Layer Norm Layer Norm ReLU Masked Self-Attention Layer Norm

xi W 1

F

W 1

K

W 1

V

ki W 1

Q

vi qi ai W 2

K

W 2

V

W 2

Q

W 2

F

WL1 WL2

Self-Attention Sublayer Encoder-Decoder Attention Sublayer Feed-Forward Network Sublayer

zi hi

Encoder Hidden State

zi

Layer Norm

WE

Embedding Layer Position Encoding

W ⊺

E Tied Linear Layer

slide-28
SLIDE 28

Transformer Model3

◮ Embedding Layer ◮ Encoder Layer (2 sublayers)

  • 1. Self-attention
  • 2. Feed-forward network

◮ Decoder Layer (3 sublayers)

  • 1. Masked self-attention
  • 2. Encoder-decoder attention
  • 3. Feed-forward network

◮ Output generation layer

3Attention is all you need, NIPS 2017

Enc-Dec Inter Attention

ki vi qi ai

Layer Norm Layer Norm ReLU Masked Self-Attention Layer Norm

xi W 1

F

W 1

K

W 1

V

ki W 1

Q

vi qi ai W 2

K

W 2

V

W 2

Q

W 2

F

WL1 WL2

Self-Attention Sublayer Encoder-Decoder Attention Sublayer Feed-Forward Network Sublayer

zi hi

Encoder Hidden State

zi

Layer Norm

WE

Embedding Layer Position Encoding

W ⊺

E Tied Linear Layer

slide-29
SLIDE 29

Transformer Decoder’s Parameters

Embedding Layer

◮ WE ∈ Rdm×V

Enc-Dec Inter Attention

ki vi qi ai

Layer Norm Layer Norm ReLU Masked Self-Attention Layer Norm

xi W 1

F

W 1

K

W 1

V

ki W 1

Q

vi qi ai W 2

K

W 2

V

W 2

Q

W 2

F

WL1 WL2

Self-Attention Sublayer Encoder-Decoder Attention Sublayer Feed-Forward Network Sublayer

zi hi

Encoder Hidden State

zi

Layer Norm

WE

Embedding Layer Position Encoding

W ⊺

E Tied Linear Layer

slide-30
SLIDE 30

Transformer Decoder’s Parameters

Embedding Layer

◮ WE ∈ Rdm×V

Masked Self-Attention

◮ W 1

K, W 1 V, W 1 Q, W 1 F ∈ Rdm×dm

Enc-Dec Inter Attention

ki vi qi ai

Layer Norm Layer Norm ReLU Masked Self-Attention Layer Norm

xi W 1

F

W 1

K

W 1

V

ki W 1

Q

vi qi ai W 2

K

W 2

V

W 2

Q

W 2

F

WL1 WL2

Self-Attention Sublayer Encoder-Decoder Attention Sublayer Feed-Forward Network Sublayer

zi hi

Encoder Hidden State

zi

Layer Norm

WE

Embedding Layer Position Encoding

W ⊺

E Tied Linear Layer

slide-31
SLIDE 31

Transformer Decoder’s Parameters

Embedding Layer

◮ WE ∈ Rdm×V

Masked Self-Attention

◮ W 1

K, W 1 V, W 1 Q, W 1 F ∈ Rdm×dm

Encoder-Decoder Attention

◮ W 2

K, W 2 V, W 2 Q, W 2 F ∈ Rdm×dm

Enc-Dec Inter Attention

ki vi qi ai

Layer Norm Layer Norm ReLU Masked Self-Attention Layer Norm

xi W 1

F

W 1

K

W 1

V

ki W 1

Q

vi qi ai W 2

K

W 2

V

W 2

Q

W 2

F

WL1 WL2

Self-Attention Sublayer Encoder-Decoder Attention Sublayer Feed-Forward Network Sublayer

zi hi

Encoder Hidden State

zi

Layer Norm

WE

Embedding Layer Position Encoding

W ⊺

E Tied Linear Layer

slide-32
SLIDE 32

Transformer Decoder’s Parameters

Embedding Layer

◮ WE ∈ Rdm×V

Masked Self-Attention

◮ W 1

K, W 1 V, W 1 Q, W 1 F ∈ Rdm×dm

Encoder-Decoder Attention

◮ W 2

K, W 2 V, W 2 Q, W 2 F ∈ Rdm×dm

Feed-Forward Network

◮ WL1 ∈ Rdm×dh ◮ WL2 ∈ Rdh×dm

Enc-Dec Inter Attention

ki vi qi ai

Layer Norm Layer Norm ReLU Masked Self-Attention Layer Norm

xi W 1

F

W 1

K

W 1

V

ki W 1

Q

vi qi ai W 2

K

W 2

V

W 2

Q

W 2

F

WL1 WL2

Self-Attention Sublayer Encoder-Decoder Attention Sublayer Feed-Forward Network Sublayer

zi hi

Encoder Hidden State

zi

Layer Norm

WE

Embedding Layer Position Encoding

W ⊺

E Tied Linear Layer

slide-33
SLIDE 33

Parameter Sharing Strategies

Embedding Layer Masked Self-Attention Enc-Dec Attention Feed-Forward Network Embedding Layer Masked Self-Attention Enc-Dec Attention Feed-Forward Network Decoder 1 Shareable Parameters Decoder 2 Embedding Layer Self-Attention Feed-Forward Network Encoder Source Language: ''En" Target Language 1: "De" Target Language 2: "Nl" Tied Linear Layer Tied Linear Layer

◮ Shareable parameters: embeddings, attention, embedding, linear layer weights.

slide-34
SLIDE 34

Parameter Sharing Strategies

◮ Θ = set of shared parameters

slide-35
SLIDE 35

No Parameter Sharing

Encoder 1 Decoder 1 Source Language: "En" Target Language: "De" Encoder 2 Decoder 2 Source Language: "En" Target Language: "Nl"

◮ Separate bilingual translation models Θ = ∅

slide-36
SLIDE 36

Embedding Sharing

◮ Common embedding layer Θ = {WE}

slide-37
SLIDE 37

+Encoder Sharing

Embedding Layer Masked Self-Attention Enc-Dec Attention Feed-Forward Network Embedding Layer Masked Self-Attention Enc-Dec Attention Feed-Forward Network Decoder 1 Shareable Parameters Decoder 2 Embedding Layer Self-Attention Feed-Forward Network Encoder Source Language: ''En" Target Language 1: "De" Target Language 2: "Nl" Tied Linear Layer Tied Linear Layer

◮ Common encoder and separate decoder for each target language Θ = {WE, θENC}

slide-38
SLIDE 38

+Decoder Sharing

◮ Next, include decoder parameters among the set of shared parameters.

slide-39
SLIDE 39

+Decoder Sharing

◮ Next, include decoder parameters among the set of shared parameters. ◮ Exponentially many combinations possible: only select a subset.

slide-40
SLIDE 40

+Decoder Sharing

◮ Next, include decoder parameters among the set of shared parameters. ◮ Exponentially many combinations possible: only select a subset. ◮ The selected weights are shared in all layers.

slide-41
SLIDE 41

Parameter Sharing Strategies

Embedding Layer Masked Self-Attention Enc-Dec Attention Feed-Forward Network Embedding Layer Masked Self-Attention Enc-Dec Attention Feed-Forward Network Decoder 1 Shareable Parameters Decoder 2 Embedding Layer Self-Attention Feed-Forward Network Encoder Source Language: ''En" Target Language 1: "De" Target Language 2: "Nl" Tied Linear Layer Tied Linear Layer

◮ FFN sublayer parameters are shared Θ =

  • WE, θENC, WL1, WL2
slide-42
SLIDE 42

Parameter Sharing Strategies

Embedding Layer Masked Self-Attention Enc-Dec Attention Feed-Forward Network Embedding Layer Masked Self-Attention Enc-Dec Attention Feed-Forward Network Decoder 1 Shareable Parameters Decoder 2 Embedding Layer Self-Attention Feed-Forward Network Encoder Source Language: ''En" Target Language 1: "De" Target Language 2: "Nl" Tied Linear Layer Tied Linear Layer

◮ Sharing the weights of the self-attention sublayer Θ =

  • WE, θENC, W 1

K, W 1 Q, W 1 V, W 1 F

slide-43
SLIDE 43

Parameter Sharing Strategies

Embedding Layer Masked Self-Attention Enc-Dec Attention Feed-Forward Network Embedding Layer Masked Self-Attention Enc-Dec Attention Feed-Forward Network Decoder 1 Shareable Parameters Decoder 2 Embedding Layer Self-Attention Feed-Forward Network Encoder Source Language: ''En" Target Language 1: "De" Target Language 2: "Nl" Tied Linear Layer Tied Linear Layer

◮ Sharing the weights of the encoder-decoder attention sublayer Θ =

  • WE, θENC, W 2

K, W 2 Q, W 2 V, W 2 F

slide-44
SLIDE 44

Parameter Sharing Strategies

Embedding Layer Masked Self-Attention Enc-Dec Attention Feed-Forward Network Embedding Layer Masked Self-Attention Enc-Dec Attention Feed-Forward Network Decoder 1 Shareable Parameters Decoder 2 Embedding Layer Self-Attention Feed-Forward Network Encoder Source Language: ''En" Target Language 1: "De" Target Language 2: "Nl" Tied Linear Layer Tied Linear Layer

◮ Limit the attention weights to the key and query weights Θ =

  • WE, θENC, W 1

K, W 1 Q, W 2 K, W 2 Q

slide-45
SLIDE 45

Parameter Sharing Strategies

Embedding Layer Masked Self-Attention Enc-Dec Attention Feed-Forward Network Embedding Layer Masked Self-Attention Enc-Dec Attention Feed-Forward Network Decoder 1 Shareable Parameters Decoder 2 Embedding Layer Self-Attention Feed-Forward Network Encoder Source Language: ''En" Target Language 1: "De" Target Language 2: "Nl" Tied Linear Layer Tied Linear Layer

◮ Limit the attention weights to the key and value weights Θ =

  • WE, θENC, W 1

K, W 1 V, W 2 K, W 2 V

slide-46
SLIDE 46

Parameter Sharing Strategies

Embedding Layer Masked Self-Attention Enc-Dec Attention Feed-Forward Network Embedding Layer Masked Self-Attention Enc-Dec Attention Feed-Forward Network Decoder 1 Shareable Parameters Decoder 2 Embedding Layer Self-Attention Feed-Forward Network Encoder Source Language: ''En" Target Language 1: "De" Target Language 2: "Nl" Tied Linear Layer Tied Linear Layer

◮ Sharing all the decoder parameters to have a single unified model (Θ =

  • WE, θENC, θDEC
  • )
slide-47
SLIDE 47

Dataset

◮ Six language pairs from the TED talks dataset.4 https://github.com/neulab/word-embeddings-for-nmt

4When and Why are Pre-trained Word Embeddings Useful for Neural

Machine Translation?, NAACL 2018

slide-48
SLIDE 48

Dataset

◮ Six language pairs from the TED talks dataset.4 https://github.com/neulab/word-embeddings-for-nmt ◮ Languages belong to different linguistic families

4When and Why are Pre-trained Word Embeddings Useful for Neural

Machine Translation?, NAACL 2018

slide-49
SLIDE 49

Dataset

◮ Six language pairs from the TED talks dataset.4 https://github.com/neulab/word-embeddings-for-nmt ◮ Languages belong to different linguistic families

◮ Romanian (Ro) and French (Fr) are Romance languages

4When and Why are Pre-trained Word Embeddings Useful for Neural

Machine Translation?, NAACL 2018

slide-50
SLIDE 50

Dataset

◮ Six language pairs from the TED talks dataset.4 https://github.com/neulab/word-embeddings-for-nmt ◮ Languages belong to different linguistic families

◮ Romanian (Ro) and French (Fr) are Romance languages ◮ German (De) and Dutch (Nl) are Germanic languages

4When and Why are Pre-trained Word Embeddings Useful for Neural

Machine Translation?, NAACL 2018

slide-51
SLIDE 51

Dataset

◮ Six language pairs from the TED talks dataset.4 https://github.com/neulab/word-embeddings-for-nmt ◮ Languages belong to different linguistic families

◮ Romanian (Ro) and French (Fr) are Romance languages ◮ German (De) and Dutch (Nl) are Germanic languages ◮ Turkish (Tr) and Japanese (Ja) are unrelated languages

◮ Turkish: Turkic family ◮ Japanese: Japonic family

4When and Why are Pre-trained Word Embeddings Useful for Neural

Machine Translation?, NAACL 2018

slide-52
SLIDE 52

Multilingual Model Training Details

◮ Extra target language token at the start of source sentence.

slide-53
SLIDE 53

Multilingual Model Training Details

◮ Extra target language token at the start of source sentence. ◮ Trained using balanced mini-batches for every target language.

slide-54
SLIDE 54

Multilingual Model Training Details

◮ Extra target language token at the start of source sentence. ◮ Trained using balanced mini-batches for every target language. ◮ Minimize weighted average cross-entropy loss.

slide-55
SLIDE 55

Multilingual Model Training Details

◮ Extra target language token at the start of source sentence. ◮ Trained using balanced mini-batches for every target language. ◮ Minimize weighted average cross-entropy loss.

◮ Weighting term is proportional to word count in target languages.

slide-56
SLIDE 56

Results

Baselines

◮ GNMT Model: Based on recurrent LSTMs, residual connections, attention

slide-57
SLIDE 57

Results

Baselines

◮ GNMT Model: Based on recurrent LSTMs, residual connections, attention

  • 1. GNMT NS: No Sharing
slide-58
SLIDE 58

Results

Baselines

◮ GNMT Model: Based on recurrent LSTMs, residual connections, attention

  • 1. GNMT NS: No Sharing
  • 2. GNMT FS: Full Sharing
slide-59
SLIDE 59

Results

Baselines

◮ Transformer NS: Separate models for each language pair

slide-60
SLIDE 60

Results

Baselines

◮ Transformer NS: Separate models for each language pair ◮ Transformer FS: One model for all language pairs

slide-61
SLIDE 61

Results: Target languages are from the same family

G N M T N S G N M T F S T F N S T F F S 20 22 24 26 28 30 32 34 BLEU

En Ro

G N M T N S G N M T F S T F N S T F F S 36 38 40 42 44 46 48 50 BLEU

En Fr En Ro + Fr

G N M T N S G N M T F S T F N S T F F S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S 24 26 28 30 32 34 36 38 BLEU

En Nl En De + Nl

slide-62
SLIDE 62

Results: Target languages are from the same family

G N M T N S G N M T F S T F N S T F F S 20 22 24 26 28 30 32 34 BLEU

En Ro

G N M T N S G N M T F S T F N S T F F S 36 38 40 42 44 46 48 50 BLEU

En Fr En Ro + Fr

G N M T N S G N M T F S T F N S T F F S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S 24 26 28 30 32 34 36 38 BLEU

En Nl En De + Nl

BLEU Scores ◮ GNMT NS ≪ GNMT FS < TF NS ≪ TF FS

slide-63
SLIDE 63

Results: Target languages are from different families

G N M T N S G N M T F S T F N S T F F S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S 14 15 16 17 18 19 20 21 BLEU

En Tr En De + Tr

G N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 14 15 16 17 18 19 20 21 BLEU

En Ja En De + Ja

slide-64
SLIDE 64

Results: Target languages are from different families

G N M T N S G N M T F S T F N S T F F S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S 14 15 16 17 18 19 20 21 BLEU

En Tr En De + Tr

G N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 14 15 16 17 18 19 20 21 BLEU

En Ja En De + Ja

BLEU Scores ◮ GNMT NS ≪ GNMT FS <≈ TF NS ◮ TF NS ≥ TF FS for En → De + Tr ◮ TF NS ≈ TF FS for En → De + Ja

slide-65
SLIDE 65

Results: Target languages are from the same family

Transformer Partial Sharing: Θ = {WE}

G N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En Ro

G N M T N S G N M T F S T F N S T F F S T F P S 36 38 40 42 44 46 48 50 BLEU

En Fr En Ro + Fr

G N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 24 26 28 30 32 34 36 38 BLEU

En Nl En De + Nl

BLEU Scores: ◮ TF FS > TF PS for En → Ro + Fr ◮ TF FS ≈ TF PS for En → De + Nl

slide-66
SLIDE 66

Results: Target languages are from different families

Transformer Partial Sharing: Θ = {WE}

G N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 14 15 16 17 18 19 20 21 BLEU

En Tr En De + Tr

G N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 14 15 16 17 18 19 20 21 BLEU

En Ja En De + Ja

BLEU Scores ◮ TF FS < TF PS for En → De + Tr ◮ TF FS ≈ TF PS for En → De + Ja

slide-67
SLIDE 67

Results: Target languages are from the same family

Transformer Partial Sharing: Θ = {WE} + {θENC}

G N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En Ro

G N M T N S G N M T F S T F N S T F F S T F P S 36 38 40 42 44 46 48 50 BLEU

En Fr En Ro + Fr

G N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 24 26 28 30 32 34 36 38 BLEU

En Nl En De + Nl

BLEU Scores: ◮ TF FS > TF PS for En → Ro + Fr and En → De + Nl

slide-68
SLIDE 68

Results: Target languages are from different families

Transformer Partial Sharing: Θ = {WE} + {θENC}

G N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 14 15 16 17 18 19 20 21 BLEU

En Tr En De + Tr

G N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 14 15 16 17 18 19 20 21 BLEU

En Ja En De + Ja

BLEU Scores: ◮ TF FS < TF PS for En → De + Tr ◮ TF FS ≈ TF PS for En → De + Ja

slide-69
SLIDE 69

Results: Target languages are from the same family

Transformer Partial Sharing: Θ =

  • WE, θENC
  • +
  • WL1, WL2
  • G

N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En Ro

G N M T N S G N M T F S T F N S T F F S T F P S 36 38 40 42 44 46 48 50 BLEU

En Fr En Ro + Fr

G N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 24 26 28 30 32 34 36 38 BLEU

En Nl En De + Nl

BLEU Scores: ◮ TF FS > TF PS for En → Ro + Fr and En → De + Nl

slide-70
SLIDE 70

Results: Target languages are from different families

Transformer Partial Sharing: Θ =

  • WE, θENC
  • +
  • WL1, WL2
  • G

N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 14 15 16 17 18 19 20 21 BLEU

En Tr En De + Tr

G N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 14 15 16 17 18 19 20 21 BLEU

En Ja En De + Ja

BLEU Scores: ◮ TF FS < TF PS for En → De + Tr and En → De + Ja

slide-71
SLIDE 71

Results: Target languages are from the same family

Transformer Partial Sharing: Θ =

  • WE, θENC} + {W 1

K, W 1 Q, W 1 V, W 1 F

  • G

N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En Ro

G N M T N S G N M T F S T F N S T F F S T F P S 36 38 40 42 44 46 48 50 BLEU

En Fr En Ro + Fr

G N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 24 26 28 30 32 34 36 38 BLEU

En Nl En De + Nl

BLEU Scores: ◮ TF FS > TF PS for En → Ro + Fr ◮ TF FS ≈ TF PS for En → De + Nl

slide-72
SLIDE 72

Results: Target languages are from different families

Transformer Partial Sharing: Θ =

  • WE, θENC} + {W 1

K, W 1 Q, W 1 V, W 1 F

  • G

N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 14 15 16 17 18 19 20 21 BLEU

En Tr En De + Tr

G N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 14 15 16 17 18 19 20 21 BLEU

En Ja En De + Ja

BLEU Scores: ◮ TF FS < TF PS for En → De + Tr ◮ TF FS ≈ TF PS for En → De + Ja

slide-73
SLIDE 73

Results: Target languages are from the same family

Transformer Partial Sharing: Θ =

  • WE, θENC} + {W 2

K, W 2 Q, W 2 V, W 2 F

  • G

N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En Ro

G N M T N S G N M T F S T F N S T F F S T F P S 36 38 40 42 44 46 48 50 BLEU

En Fr En Ro + Fr

G N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 24 26 28 30 32 34 36 38 BLEU

En Nl En De + Nl

BLEU Scores: ◮ TF FS ≈ TF PS for En → Ro + Fr and En → De + Nl

slide-74
SLIDE 74

Results: Target languages are from different families

Transformer Partial Sharing: Θ =

  • WE, θENC} + {W 2

K, W 2 Q, W 2 V, W 2 F

  • G

N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 14 15 16 17 18 19 20 21 BLEU

En Tr En De + Tr

G N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 14 15 16 17 18 19 20 21 BLEU

En Ja En De + Ja

BLEU Scores: ◮ TF FS < TF PS for En → De + Tr ◮ TF FS ≈ TF PS for En → De + Ja

slide-75
SLIDE 75

Results: Target languages are from the same family

Transformer Partial Sharing: Θ =

  • WE, θENC} + {W 1

K, W 1 V, W 2 K, W 2 V

  • G

N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En Ro

G N M T N S G N M T F S T F N S T F F S T F P S 36 38 40 42 44 46 48 50 BLEU

En Fr En Ro + Fr

G N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 24 26 28 30 32 34 36 38 BLEU

En Nl En De + Nl

BLEU Scores: ◮ TF FS > TF PS for En → Ro + Fr ◮ TF FS ≈ TF PS for En → De + Nl

slide-76
SLIDE 76

Results: Target languages are from different families

Transformer Partial Sharing: Θ =

  • WE, θENC} + {W 1

K, W 1 V, W 2 K, W 2 V

  • G

N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 14 15 16 17 18 19 20 21 BLEU

En Tr En De + Tr

G N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 14 15 16 17 18 19 20 21 BLEU

En Ja En De + Ja

BLEU Scores: ◮ TF FS < TF PS for En → De + Tr and En → De + Ja

slide-77
SLIDE 77

Results: Target languages are from the same family

Transformer Partial Sharing: Θ =

  • WE, θENC} + {W 1

K, W 1 Q, W 2 K, W 2 Q

  • G

N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En Ro

G N M T N S G N M T F S T F N S T F F S T F P S 36 38 40 42 44 46 48 50 BLEU

En Fr En Ro + Fr

G N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 24 26 28 30 32 34 36 38 BLEU

En Nl En De + Nl

BLEU Scores: ◮ TF FS ≈ TF PS for En → Ro + Fr and En → De + Nl

slide-78
SLIDE 78

Results: Target languages are from different families

Transformer Partial Sharing: Θ =

  • WE, θENC} + {W 1

K, W 1 Q, W 2 K, W 2 Q

  • G

N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 14 15 16 17 18 19 20 21 BLEU

En Tr En De + Tr

G N M T N S G N M T F S T F N S T F F S T F P S 20 22 24 26 28 30 32 34 BLEU

En De

G N M T N S G N M T F S T F N S T F F S T F P S 14 15 16 17 18 19 20 21 BLEU

En Ja En De + Ja

BLEU Scores: ◮ TF FS ≪ TF PS for En → De + Tr and En → De + Ja

slide-79
SLIDE 79

Results: Target languages are from the same family

◮ Sharing all parameters leads to the best BLEU scores for En→Ro+Fr

slide-80
SLIDE 80

Results: Target languages are from the same family

◮ Sharing all parameters leads to the best BLEU scores for En→Ro+Fr ◮ Sharing only the key, query from both the decoder attention layers leads to the best BLEU scores for En→De+Nl

slide-81
SLIDE 81

Results: Target languages are from distant families

◮ Sharing all the parameters leads to a noticeable drop in the BLEU scores for both the considered language pairs.

slide-82
SLIDE 82

Results: Target languages are from distant families

◮ Sharing all the parameters leads to a noticeable drop in the BLEU scores for both the considered language pairs. ◮ Sharing the key, query parameters results in a large increase in the BLEU scores.

slide-83
SLIDE 83

Conclusions

◮ We explore parameter sharing strategies for multilingual translation using self-attentional models.

slide-84
SLIDE 84

Conclusions

◮ We explore parameter sharing strategies for multilingual translation using self-attentional models. ◮ We examine the case when the target languages come from the same or distant language families.

slide-85
SLIDE 85

Conclusions

◮ We explore parameter sharing strategies for multilingual translation using self-attentional models. ◮ We examine the case when the target languages come from the same or distant language families. ◮ The popular approach of full parameter sharing may perform well only when the target languages belong to the same family.

slide-86
SLIDE 86

Conclusions

◮ We explore parameter sharing strategies for multilingual translation using self-attentional models. ◮ We examine the case when the target languages come from the same or distant language families. ◮ The popular approach of full parameter sharing may perform well only when the target languages belong to the same family. ◮ Partial parameter sharing of embedding, encoder, decoder’s key, query weights is applicable to all kinds of language pairs.

slide-87
SLIDE 87

Conclusions

◮ We explore parameter sharing strategies for multilingual translation using self-attentional models. ◮ We examine the case when the target languages come from the same or distant language families. ◮ The popular approach of full parameter sharing may perform well only when the target languages belong to the same family. ◮ Partial parameter sharing of embedding, encoder, decoder’s key, query weights is applicable to all kinds of language pairs. ◮ Partial parameter sharing achieves the best BLEU scores when the target languages are from distant families.

slide-88
SLIDE 88

Conclusions

◮ We explore parameter sharing strategies for multilingual translation using self-attentional models. ◮ We examine the case when the target languages come from the same or distant language families. ◮ The popular approach of full parameter sharing may perform well only when the target languages belong to the same family. ◮ Partial parameter sharing of embedding, encoder, decoder’s key, query weights is applicable to all kinds of language pairs. ◮ Partial parameter sharing achieves the best BLEU scores when the target languages are from distant families. Code: https://github.com/DevSinghSachan/multilingual nmt

slide-89
SLIDE 89

Conclusions

◮ We explore parameter sharing strategies for multilingual translation using self-attentional models. ◮ We examine the case when the target languages come from the same or distant language families. ◮ The popular approach of full parameter sharing may perform well only when the target languages belong to the same family. ◮ Partial parameter sharing of embedding, encoder, decoder’s key, query weights is applicable to all kinds of language pairs. ◮ Partial parameter sharing achieves the best BLEU scores when the target languages are from distant families. Code: https://github.com/DevSinghSachan/multilingual nmt Thank you! Questions?