Cross linguality and machine translation without bilingual data ith - - PowerPoint PPT Presentation

cross linguality and machine translation without
SMART_READER_LITE
LIVE PREVIEW

Cross linguality and machine translation without bilingual data ith - - PowerPoint PPT Presentation

Cross linguality and machine translation without bilingual data ith t bili l d t Enek Eneko Agirr girre @ @eagirre i Joint work with Mikel Artetxe Gorka Labaka Joint work with: Mikel Artetxe, Gorka Labaka IXA NLP group University


slide-1
SLIDE 1

Cross‐linguality and machine translation ith t bili l d t without bilingual data Enek Eneko Agirr girre

@ i @eagirre

Joint work with Mikel Artetxe Gorka Labaka Joint work with: Mikel Artetxe, Gorka Labaka

IXA NLP group – University of the Basque Country (UPV/EHU) IXA NLP group – University of the Basque Country (UPV/EHU)

http://ixa.eus

slide-2
SLIDE 2

Motivation

Cr Cross

  • ss‐lingua

lingual wo word rd repr presen esentations: tions: W d W d b ddi ddi k f N l L P i

  • Wor
  • rd em

embeddi ddings ngs key ey for Natural Language Processing

  • Mapped

pped embeddings embeddings represent languages in a single space

  • Depend on seed bilingual

bilingual dictionaries dictionaries

  • Depend on seed bilingual

bilingual dictionaries dictionaries

  • Ex

Exciting citing re results in dictionary induction, transfer learning, crosslingual applications, interlingual semantic representations 2

slide-3
SLIDE 3

Motivation

Cr Cross

  • ss‐lingua

lingual wo word rd repr presen esentations: tions: W d W d b ddi ddi k f N l L P i

  • Wor
  • rd em

embeddi ddings ngs key ey for Natural Language Processing

  • Mapped

pped embeddings embeddings represent languages in a single space

  • Depend on seed bilingual

bilingual dictionaries dictionaries

  • Depend on seed bilingual

bilingual dictionaries dictionaries

  • Ex

Exciting citing re results in dictionary induction, transfer learning, crosslingual applications, interlingual semantic representations Our focus: ex extend mappings mappings to to an any pair pair of

  • f languag

languages

  • Most language pairs have ve

very fe few bilingual bilingual re resources

  • Key research area for wi

wide de adop adoption tion of NLP tools 3

slide-4
SLIDE 4

Motivation

Cr Cross

  • ss‐lingua

lingual wo word rd repr presen esentations: tions: W d W d b ddi ddi k f N l L P i

  • Wor
  • rd em

embeddi ddings ngs key ey for Natural Language Processing

  • Mapped

pped embeddings embeddings represent languages in a single space

  • Depend on seed bilingual

bilingual dictionaries dictionaries

  • Depend on seed bilingual

bilingual dictionaries dictionaries

  • Ex

Exciting citing re results in dictionary induction, transfer learning, crosslingual applications, interlingual semantic representations Our focus: ex extend mappings mappings to to an any pair pair of

  • f languag

languages

  • Most language pairs have ve

very fe few bilingual bilingual re resources

  • Key research area for wi

wide de adop adoption tion of NLP tools In particular: no no bilingual bilingual re resources at at all all

  • Unsu

Unsuper ervised vised embedding mappings 4 p g pp g

  • Unsuper

Unsupervised ised neural machine translation

slide-5
SLIDE 5

Overview

Ar Arab abic ic monolin nolingual al co corpora Chin Chinese ese monolin nolingu gual al co corpora

Ar Arabic abic embed embeddings ings Chinese Chinese embed embeddings ings

5

slide-6
SLIDE 6

Overview

Ar Arab abic ic monolin nolingual al co corpora Chin Chinese ese monolin nolingu gual al co corpora

Ar Arabic abic embed embeddings ings Chinese Chinese embed embeddings ings Bilin ilingual embed embeddings ings

6

slide-7
SLIDE 7

Overview

Ar Arab abic ic monolin nolingual al co corpora Chin Chinese ese monolin nolingu gual al co corpora

Ar Arabic abic embed embeddings ings Chinese Chinese embed embeddings ings Bilin ilingual embed embeddings ings

Ma Machine Bilin Bilingual ual Cr Crosslingual

  • sslingual &

7

tr transla anslatio tion g dictionaries dictionaries Cr Crosslingual

  • sslingual &

multilingual ltilingual applic applications ions

slide-8
SLIDE 8

Overview

Ar Arab abic ic monolin nolingual al co corpora Chin Chinese ese monolin nolingu gual al co corpora

No No No No bilin bilingual ual g re resource

Ar Arabic abic embed embeddings ings Chinese Chinese embed embeddings ings Bilin ilingual embed embeddings ings

Ma Machine Bilin Bilingual ual Cr Crosslingual

  • sslingual &

8

tr transla anslatio tion g dictionaries dictionaries Cr Crosslingual

  • sslingual &

multilingual ltilingual applic applications ions

slide-9
SLIDE 9

Outline

  • Bilingual embedding mappings

I t d ti t t d l ( b ddi )

  • Introduction to vector space models (embeddings)
  • Bilingual embedding mappings (AAAI18)

d d

  • Reduced supervision
  • Self‐learning, semi‐supervised (ACL17)

lf l f ll d ( )

  • Self‐learning, fully unsupervised (ACL18)
  • Conclusions
  • Unsupervised neural machine translation
  • Introduction to NMT
  • From bilingual embeddings to uNMT (ICLR18)
  • Unsupervised statistical MT (EMNLP18)

p ( )

  • Conclusions

10

slide-10
SLIDE 10

Outline

  • Bilingual embedding mappings

I t d ti t t d l ( b ddi )

  • Introduction to vector space models (embeddings)
  • Bilingual embedding mappings (AAAI18)

d d

  • Reduced supervision
  • Self‐learning, semi‐supervised (ACL17)

lf l f ll d ( )

  • Self‐learning, fully unsupervised (ACL18)
  • Conclusions
  • Unsupervised neural machine translation
  • Introduction to NMT
  • From bilingual embeddings to uNMT (ICLR18)
  • Unsupervised statistical MT (EMNLP18)

p ( )

  • Conclusions

11

slide-11
SLIDE 11

Introduction to vector space models

12

slide-12
SLIDE 12

Introduction to vector space models

S i S i Seman emantic space space

‐ Words ‐ Meaningful distances

Donostia Bilbo Maule

  • D. Garazi

Baiona

g ‐ Meaningful relations ‐ 300 dimensions ‐ Neural networks / linear algebra

Gasteiz Iruñea

from co‐occurrence counts

Apple House Pear

Geogr Geographi phical space space

C Banana Calendar

‐ Cities ‐ Meaningful distances ‐ Meaningful relations

Moo Bark Dog cat Cow

34

Meaningful relations ‐ 2 dimensions ‐ Cartographers from 3D world

Meow

slide-13
SLIDE 13

Introduction to embedding mappings

43

slide-14
SLIDE 14

Introduction to embedding mappings

Ba Basque que En English glish

44

slide-15
SLIDE 15

Introduction to embedding mappings

Seed Seed dictionar dictionary

Ba Basque que En English glish

45

slide-16
SLIDE 16

Introduction to embedding mappings

Seed Seed dictionar dictionary

Ba Basque que En English glish

Txakur Sagar Dog Apple

46

⋮ Egutegi ⋮ Calendar

slide-17
SLIDE 17

Introduction to embedding mappings

Seed Seed dictionar dictionary

Ba Basque que En English glish

Txakur Sagar Dog Apple

47

⋮ Egutegi ⋮ Calendar

slide-18
SLIDE 18

Introduction to embedding mappings

Seed Seed dictionar dictionary

Ba Basque que En English glish

Txakur Sagar Dog Apple

48

⋮ Egutegi ⋮ Calendar

slide-19
SLIDE 19

Introduction to embedding mappings

Seed Seed dictionar dictionary

Ba Basque que En English glish

Txakur Sagar Dog Apple

49

⋮ Egutegi ⋮ Calendar

slide-20
SLIDE 20

Introduction to embedding mappings

Seed Seed dictionar dictionary

Ba Basque que En English glish

Txakur Sagar Dog Apple

50

⋮ Egutegi ⋮ Calendar

slide-21
SLIDE 21

Introduction to embedding mappings

Mikolov et al (2013b) Ba Basque que En English glish Mikolov et al. (2013b)

Txakur Sagar Dog Apple

51

⋮ Egutegi ⋮ Calendar

slide-22
SLIDE 22

Introduction to embedding mappings

Mikolov et al (2013b) Ba Basque que En English glish Mikolov et al. (2013b)

Txakur Sagar Dog Apple

52

⋮ Egutegi ⋮ Calendar

slide-23
SLIDE 23

Introduction to embedding mappings

Mikolov et al (2013b) Ba Basque que En English glish Mikolov et al. (2013b)

Txakur Sagar Dog Apple

53

⋮ Egutegi ⋮ Calendar

slide-24
SLIDE 24

Introduction to embedding mappings

Mikolov et al (2013b) Ba Basque que En English glish Mikolov et al. (2013b)

Txakur Sagar Dog Apple

54

⋮ Egutegi ⋮ Calendar

slide-25
SLIDE 25

State‐of‐the‐art in super supervised isedmappings Artetxe et al AAAI 2018 Artetxe et al. AAAI 2018

  • Use 5000 sized seed bilingual dictionary
  • Use 5000 sized seed bilingual dictionary
  • Framework subsuming previous work,

l t i W W learns two mappings WX WZ as sequ sequence ces of

  • f (optional)

ptional) line linear mappings mappings:

( ) P (opt.) Pre‐process 1. (opt.) Whitening h l h l 2. 2. Or Orthogona

  • gonal ma

mapping 3. (opt.) Re‐weighting ( ) 4. (opt.) De‐whitening

  • The optional steps, properly combined,

bring up to 5 points improvement

55

slide-26
SLIDE 26

State‐of‐the‐art in super supervised isedmappings

S0 (opt.) Pre‐processing: length normalization, mean centering

slide-27
SLIDE 27

State‐of‐the‐art in super supervised isedmappings

Two sequences of (optional) linear transformations: S0 (opt.) Pre‐processing: length normalization, mean centering

slide-28
SLIDE 28

State‐of‐the‐art in super supervised isedmappings

Two sequences of (optional) linear transformations: S0 (opt.) Pre‐processing: length normalization, mean centering S1 (opt ) Whitening : turn covariance S1 (opt.) Whitening : turn covariance matrices into the identity matrix

58

slide-29
SLIDE 29

State‐of‐the‐art in super supervised isedmappings

Two sequences of (optional) linear transformations: S0 (opt.) Pre‐processing: length normalization, mean centering S1 (opt ) Whitening : turn covariance S1 (opt.) Whitening : turn covariance matrices into the identity matrix S2 S2 Ort Orthog

  • gonal
  • nal mapping:

mapping: map into a S2 S2 Ort Orthog

  • gonal
  • nal mapping:

mapping: map into a shared space (Procrustes)

slide-30
SLIDE 30

State‐of‐the‐art in super supervised isedmappings

Two sequences of (optional) linear transformations: S0 (opt.) Pre‐processing: length normalization, mean centering S1 (opt ) Whitening : turn covariance S1 (opt.) Whitening : turn covariance matrices into the identity matrix S2 S2 Ort Orthog

  • gonal
  • nal mapping:

mapping: map into a S2 S2 Ort Orthog

  • gonal
  • nal mapping:

mapping: map into a shared space (Procrustes) S3 (opt ) Re eight each component S3 (opt.) Re‐weight each component according to its cross‐correlation

slide-31
SLIDE 31

State‐of‐the‐art in super supervised isedmappings

Two sequences of (optional) linear transformations: S0 (opt.) Pre‐processing: length normalization, mean centering S1 (opt ) Whitening : turn covariance S1 (opt.) Whitening : turn covariance matrices into the identity matrix S2 S2 Ort Orthog

  • gonal
  • nal mapping:

mapping: map into a S2 S2 Ort Orthog

  • gonal
  • nal mapping:

mapping: map into a shared space (Procrustes) S3 (opt ) Re eight each component S3 (opt.) Re‐weight each component according to its cross‐correlation S4 ( t ) D hit i t i i l S4 (opt.) De‐whitening: restore original variance in every direction

61

slide-32
SLIDE 32

State‐of‐the‐art in super supervised isedmappings

Two sequences of (optional) linear transformations: S0 (opt.) Pre‐processing: length normalization, mean centering S1 (opt ) Whitening : turn covariance S1 (opt.) Whitening : turn covariance matrices into the identity matrix S2 S2 Ort Orthog

  • gonal
  • nal mapping:

mapping: map into a S2 S2 Ort Orthog

  • gonal
  • nal mapping:

mapping: map into a shared space (Procrustes) S3 (opt ) Re eight each component S3 (opt.) Re‐weight each component according to its cross‐correlation S4 ( t ) D hit i t i i l S4 (opt.) De‐whitening: restore original variance in every direction S5 ( ) Di i li d i k h fi l S5 (opt) Dimensionality reduction: keep the first n components only

62

slide-33
SLIDE 33

State‐of‐the‐art in super supervised isedmappings

S0 S0 (l) (l) S0 (m) S1 S 1 S2 S 2 S3 S4 (sr (src) S4 (tr (trg) S5 OLS Mikolov et al. (2013) x x src trg trg Shigeto et al. (2015) x x trg src src CCA Faruqui and Dyer (2014) x x x x x q y ( ) Orth. Xing et al. (2015) x x Artetxe et al. (2016) x x x h l ( ) Zhang et al. (2016) x Smith et al. (2017) x x x

70

slide-34
SLIDE 34

State‐of‐the‐art in super supervised isedmappings

S0 S0 (l) (l) S0 (m) S1 S 1 S2 S 2 S3 S4 (sr (src) S4 (tr (trg) S5 OLS Mikolov et al. (2013) x x src trg trg Shigeto et al. (2015) x x trg src src CCA Faruqui and Dyer (2014) x x x x x q y ( ) Orth. Xing et al. (2015) x x Artetxe et al. (2016) x x x h l ( ) Zhang et al. (2016) x Smith et al. (2017) x x x Our method (AAAI18) x x x x trg src trg x

71

slide-35
SLIDE 35

Evaluating via Bilingual Dictionary induction

Dataset by Dinu et al. (2015) extended to German, Finnish, Spanish

72

slide-36
SLIDE 36

Evaluating via Bilingual Dictionary induction

Dataset by Dinu et al. (2015) extended to German, Finnish, Spanish ⇒ Monolingual embeddings (CBOW + negative sampling)

73

slide-37
SLIDE 37

Evaluating via Bilingual Dictionary induction

Dataset by Dinu et al. (2015) extended to German, Finnish, Spanish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ S d di ti 5 000 d i ⇒ Seed dictionary: 5,000 word pairs

74

slide-38
SLIDE 38

Evaluating via Bilingual Dictionary induction

Dataset by Dinu et al. (2015) extended to German, Finnish, Spanish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ S d di ti 5 000 i ⇒ Seed dictionary: 5,000 pairs ⇒ Test dictionary: 1,500 pairs (Nearest neighbor, P@1)

75

slide-39
SLIDE 39

Evaluating via Bilingual Dictionary induction

Dataset by Dinu et al. (2015) extended to German, Finnish, Spanish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ S d di ti 5 000 i

M t M th d EN EN IT IT EN EN DE DE EN EN FI FI EN EN ES ES

⇒ Seed dictionary: 5,000 pairs ⇒ Test dictionary: 1,500 pairs (Nearest neighbor, P@1)

Meth thod EN EN‐IT IT EN EN‐DE DE EN EN‐FI FI EN EN‐ES ES

76

slide-40
SLIDE 40

Evaluating via Bilingual Dictionary induction

Dataset by Dinu et al. (2015) extended to German, Finnish, Spanish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ S d di ti 5 000 i

M t M th d EN EN IT IT EN EN DE DE EN EN FI FI EN EN ES ES

⇒ Seed dictionary: 5,000 pairs ⇒ Test dictionary: 1,500 pairs (Nearest neighbor, P@1)

Meth thod EN EN‐IT IT EN EN‐DE DE EN EN‐FI FI EN EN‐ES ES Mikolov et al. (2013) 34.93† 35.00† 25.91† 27.73† Faruqui and Dyer (2014) 38.40* 37.13* 27.60* 26.80* Shigeto et al. (2015) 41.53† 43.07† 31.04† 33.73† Dinu et al. (2015) 37.7 38.93* 29.14* 30.40* Lazaridou et al. (2015) 40.2 ‐ ‐ ‐ ( ) Xing et al. (2015) 36.87† 41.27† 28.23† 31.20† Artetxe et al. (2016) 39.27 41.87* 30.62* 31.40* Zhang et al (2016) 36 73† 40 80† 28 16† 31 07† Zhang et al. (2016) 36.73 40.80 28.16 31.07 Smith et al. (2017) 43.1 43.33† 29.42† 35.13†

† our publicly available reimplementaon

78

slide-41
SLIDE 41

Evaluating via Bilingual Dictionary induction

Dataset by Dinu et al. (2015) extended to German, Finnish, Spanish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ S d di ti 5 000 i ⇒ Seed dictionary: 5,000 pairs ⇒ Test dictionary: 1,500 pairs (Nearest neighbor, P@1)

M t M th d EN EN IT IT EN EN DE DE EN EN FI FI EN EN ES ES Meth thod EN EN‐IT IT EN EN‐DE DE EN EN‐FI FI EN EN‐ES ES Mikolov et al. (2013) 34.93† 35.00† 25.91† 27.73† Faruqui and Dyer (2014) 38.40* 37.13* 27.60* 26.80* Shigeto et al. (2015) 41.53† 43.07† 31.04† 33.73† Dinu et al. (2015) 37.7 38.93* 29.14* 30.40* Lazaridou et al. (2015) 40.2 ‐ ‐ ‐ ( ) Xing et al. (2015) 36.87† 41.27† 28.23† 31.20† Ar Artetx txe et al.

  • al. (2016)

2016) 39.27 39.27 41.87 41.87* 30.62 30.62* 31.40 31.40* Zhang et al (2016) 36 73† 40 80† 28 16† 31 07† Zhang et al. (2016) 36.73 40.80 28.16 31.07 Smith et al. (2017) 43.1 43.33† 29.42† 35.13† Our method (AAAI18) 45.27 45.27 44.13 44.13 32.94 32.94 36.60 36.60

79

slide-42
SLIDE 42

Why does it work?

80

slide-43
SLIDE 43

Why does it work?

81

slide-44
SLIDE 44

Why does it work?

82

slide-45
SLIDE 45

Why does it work?

83

slide-46
SLIDE 46

Why does it work?

84

slide-47
SLIDE 47

Why does it work?

W

85

slide-48
SLIDE 48

Why does it work?

Languages are (to a large extent) isometric in word embedding space (!) isometric in word embedding space (!)

W

86

slide-49
SLIDE 49

Outline

  • Bilingual embedding mappings

I t d ti t t d l ( b ddi )

  • Introduction to vector space models (embeddings)
  • Bilingual embedding mappings (AAAI18)

d d

  • Reduced supervision
  • Self‐learning, semi‐supervised (ACL17)

lf l f ll d ( )

  • Self‐learning, fully unsupervised (ACL18)
  • Conclusions
  • Unsupervised neural machine translation
  • Introduction to NMT
  • From bilingual embeddings to uNMT (ICLR18)
  • Unsupervised statistical MT (EMNLP18)

p ( )

  • Conclusions

87

slide-50
SLIDE 50

Reducing supervision

88

slide-51
SLIDE 51

Reducing supervision

89

slide-52
SLIDE 52

Reducing supervision

Previous work

bilingual signal for training

91

for training

slide-53
SLIDE 53

Reducing supervision

Previous work

bilingual signal for training

‐ parallel corpora ‐ comparable corpora 94

for training

comparable corpora ‐ (big) dictionaries

slide-54
SLIDE 54

Reducing supervision

Previous work

bilingual signal for training

‐ parallel corpora ‐ comparable corpora 95

for training

comparable corpora ‐ (big) dictionaries

slide-55
SLIDE 55

Reducing supervision

Our work Previous work

bilingual signal for training

‐ 25 word dictionary ‐ numerals (1, 2, 3…) ‐ parallel corpora ‐ comparable corpora 99

for training

numerals (1, 2, 3…) ‐ nothing comparable corpora ‐ (big) dictionaries

slide-56
SLIDE 56

Self‐learning

100

slide-57
SLIDE 57

Self‐learning

Monolingual embeddings

101

slide-58
SLIDE 58

Self‐learning

Monolingual embeddings Dictionary

102

slide-59
SLIDE 59

Self‐learning

Monolingual embeddings Dictionary

103

slide-60
SLIDE 60

Self‐learning

Monolingual embeddings Mapping Dictionary

104

slide-61
SLIDE 61

Self‐learning

Monolingual embeddings Mapping Dictionary

105

slide-62
SLIDE 62

Self‐learning

Monolingual embeddings Mapping Dictionary Dictionary

106

slide-63
SLIDE 63

Self‐learning

Monolingual embeddings Mapping Dictionary Dictionary

bet better! er! 107

slide-64
SLIDE 64

Self‐learning

Monolingual embeddings Mapping Dictionary Dictionary

bet better! er! 108

slide-65
SLIDE 65

Self‐learning

Monolingual embeddings Mapping Dictionary Dictionary

bet better! er!

M i Mapping

109

slide-66
SLIDE 66

Self‐learning

Monolingual embeddings Mapping Dictionary Dictionary

bet better! er!

M i Mapping

110

slide-67
SLIDE 67

Self‐learning

Monolingual embeddings Mapping Dictionary Dictionary

bet better! er!

M i Di ti Mapping Dictionary

111

slide-68
SLIDE 68

Self‐learning

Monolingual embeddings Mapping Dictionary Dictionary

bet better! er!

M i Di ti

eve even bet better! er!

Mapping Dictionary

eve even bet better! er! 112

slide-69
SLIDE 69

Self‐learning

Monolingual embeddings Mapping Dictionary Dictionary

bet better! er!

M i Di ti

eve even bet better! er!

Mapping Dictionary

eve even bet better! er! 113

slide-70
SLIDE 70

Self‐learning

Monolingual embeddings Mapping Dictionary Dictionary

bet better! er!

M i Di ti

eve even bet better! er!

Mapping Dictionary

eve even bet better! er! 114

Mapping

slide-71
SLIDE 71

Self‐learning

Monolingual embeddings Mapping Dictionary Dictionary

bet better! er!

M i Di ti

eve even bet better! er!

Mapping Dictionary

eve even bet better! er! 115

Mapping

slide-72
SLIDE 72

Self‐learning

Monolingual embeddings Mapping Dictionary Dictionary

bet better! er!

M i Di ti

eve even bet better! er!

Mapping Dictionary

eve even bet better! er! 116

Mapping Dictionary

slide-73
SLIDE 73

Self‐learning

Monolingual embeddings Mapping Dictionary Dictionary

bet better! er!

M i Di ti

eve even bet better! er!

Mapping Dictionary

eve even bet better! er! 117

Mapping Dictionary

eve even bet better! er!

slide-74
SLIDE 74

Self‐learning

Monolingual embeddings Mapping Dictionary Dictionary

118

slide-75
SLIDE 75

Self‐learning

Monolingual embeddings Mapping Dictionary Dictionary

proposed self‐learning method Too good to be true?

120

slide-76
SLIDE 76

Semi‐supervised experiments (ACL17)

121

slide-77
SLIDE 77

Semi‐supervised experiments (ACL17)

  • Given monolingual embeddings

l d bili l di ti (t i di ti ) plus seed bilingual dictionary (train dictionary):

  • 25 word pairs
  • Pairs of numerals

122

slide-78
SLIDE 78

Semi‐supervised experiments (ACL17)

  • Given monolingual embeddings

l d bili l di ti (t i di ti ) plus seed bilingual dictionary (train dictionary):

  • 25 word pairs
  • Pairs of numerals
  • Induce bilingual dictionary using self‐learning

Induce bilingual dictionary using self learning for full vocabulary

123

slide-79
SLIDE 79

Semi‐supervised experiments (ACL17)

  • Given monolingual embeddings

l d bili l di ti (t i di ti ) plus seed bilingual dictionary (train dictionary):

  • 25 word pairs
  • Pairs of numerals
  • Induce bilingual dictionary using self‐learning

Induce bilingual dictionary using self learning for full vocabulary

  • Evaluation
  • Compare translations to existing bilingual dictionary

p g g y (test dictionary)

  • Accuracy

124

Accuracy

slide-80
SLIDE 80

Semi‐supervised experiments (ACL17)

English English It Italian alian English English‐It Italian alian

125

wo word rd tr tran ansla slation ion induction duction

slide-81
SLIDE 81

Why does it work?

138

slide-82
SLIDE 82

Why does it work?

Implicit objective: 139

slide-83
SLIDE 83

Why does it work?

Implicit objective: 140

slide-84
SLIDE 84

Why does it work?

Implicit objective: 141

slide-85
SLIDE 85

Why does it work?

Implicit objective: 142

slide-86
SLIDE 86

Why does it work?

Implicit objective: 143

slide-87
SLIDE 87

Why does it work?

Implicit objective: 144

slide-88
SLIDE 88

Why does it work?

Implicit objective: 145

slide-89
SLIDE 89

Why does it work?

Implicit objective: 146

slide-90
SLIDE 90

Why does it work?

Implicit objective: 147

slide-91
SLIDE 91

Why does it work?

Implicit objective: 148

slide-92
SLIDE 92

Why does it work?

Implicit objective: 149

slide-93
SLIDE 93

Why does it work?

Implicit objective: 150

slide-94
SLIDE 94

Why does it work?

Implicit objective: 151

slide-95
SLIDE 95

Why does it work?

Implicit objective: 152

slide-96
SLIDE 96

Why does it work?

Implicit objective: 153

slide-97
SLIDE 97

Why does it work?

Implicit objective: 154

slide-98
SLIDE 98

Why does it work?

Implicit objective: 155

slide-99
SLIDE 99

Why does it work?

Implicit objective: 156

slide-100
SLIDE 100

Why does it work?

Implicit objective:

Independent from seed dictionary!

157

slide-101
SLIDE 101

Why does it work?

Implicit objective: 158

slide-102
SLIDE 102

Why does it work?

Implicit objective:

Independent from seed dictionary! So why do we need a seed dictionary? Avoid poor local optima!

159

Avoid poor local optima!

slide-103
SLIDE 103

Why does it work?

Implicit objective: 160

slide-104
SLIDE 104

Next steps

Is there a way we can avoid the seed dictionary? Is there a way we can avoid the seed dictionary? Would an initial noisy initialization suffice? Would an initial noisy initialization suffice?

161

slide-105
SLIDE 105

Unsupervised experiments (ACL18)

162

slide-106
SLIDE 106

Unsupervised experiments (ACL18)

Initial dictionary:

  • 1. Compute intra‐language similarity

p g g y

  • 2. Words which are translations of each other

would have analoguous similarity histograms (isometry hyp.)

163

slide-107
SLIDE 107

Unsupervised experiments (ACL18)

Initial dictionary:

  • 1. Compute intra‐language similarity

p g g y

  • 2. Words which are translations of each other

would have analoguous similarity histograms (isometry hyp.)

164

slide-108
SLIDE 108

Unsupervised experiments (ACL18)

Initial dictionary:

  • 1. Compute intra‐language similarity

p g g y

  • 2. Words which are translations of each other

would have analoguous similarity histograms (isometry hyp.)

165

slide-109
SLIDE 109

Unsupervised experiments (ACL18)

Initial dictionary:

  • 1. Compute intra‐language similarity

p g g y

  • 2. Words which are translations of each other

would have analoguous similarity histograms (isometry hyp.)

166

slide-110
SLIDE 110

Unsupervised experiments (ACL18)

Initial dictionary:

  • 1. Compute intra‐language similarity

p g g y

  • 2. Words which are translations of each other

would have analoguous similarity histograms (isometry hyp.) It works, but very weak: Accuracy 0.52%

167

slide-111
SLIDE 111

Unsupervised experiments (ACL18)

Initial dictionary:

  • 1. Compute intra‐language similarity

p g g y

  • 2. Words which are translations of each other

would have analoguous similarity histograms (isometry hyp.) It works, but very weak: Accuracy 0.52% For self‐learning to work we had to add: g

  • 1. Stochastic dictionary induction

2 Frequency‐based vocabulary cut‐off

  • 2. Frequency‐based vocabulary cut‐off
  • 3. Hubness problem: Instead of inducing dictionary with

nearest‐neighbour use CSLS (Lample et al. 2018) nearest neighbour use CSLS (Lample et al. 2018)

168

slide-112
SLIDE 112

Unsupervised experiments (ACL18)

  • Dataset by Dinu et al. (2015) extended German, Finnish, Spanish

Super upervisi ision Method ethod EN‐IT IT EN EN‐DE DE EN EN‐FI FI EN EN‐ES ES

171

slide-113
SLIDE 113

Unsupervised experiments (ACL18)

  • Dataset by Dinu et al. (2015) extended German, Finnish, Spanish

⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Monolingual embeddings (CBOW + negative sampling)

Super upervisi ision Method ethod EN‐IT IT EN EN‐DE DE EN EN‐FI FI EN EN‐ES ES

172

slide-114
SLIDE 114

Unsupervised experiments (ACL18)

  • Dataset by Dinu et al. (2015) extended German, Finnish, Spanish

⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / none

Super upervisi ision Method ethod EN‐IT IT EN EN‐DE DE EN EN‐FI FI EN EN‐ES ES 5k dict. 25 dict. None

173

slide-115
SLIDE 115

Unsupervised experiments (ACL18)

  • Dataset by Dinu et al. (2015) extended German, Finnish, Spanish

⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / none ⇒ Test dictionary: 1,500 word pairs

Super upervisi ision Method ethod EN‐IT IT EN EN‐DE DE EN EN‐FI FI EN EN‐ES ES 5k dict. 25 dict. None

174

slide-116
SLIDE 116

Unsupervised experiments (ACL18)

  • Dataset by Dinu et al. (2015) extended German, Finnish, Spanish

⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / none ⇒ Test dictionary: 1,500 word pairs

Super upervisi ision Method ethod EN‐IT IT EN EN‐DE DE EN EN‐FI FI EN EN‐ES ES Mikolov et al. (2013) 34.93† 35.00† 25.91† 27.73† A l (2016) 39 27 41 87* 30 62* 31 40* 5k dict. Artetxe et al. (2016) 39.27 41.87* 30.62* 31.40* Smith et al. (2017) 43.1 43.33† 29.42† 35.13† Our method (AAAI18) 45.27 45.27 44.13 44.13 32.94 32.94 36.60 36.60 25 dict. None

176

slide-117
SLIDE 117

Unsupervised experiments (ACL18)

  • Dataset by Dinu et al. (2015) extended German, Finnish, Spanish

⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / none ⇒ Test dictionary: 1,500 word pairs

Super upervisi ision Method ethod EN‐IT IT EN EN‐DE DE EN EN‐FI FI EN EN‐ES ES Mikolov et al. (2013) 34.93† 35.00† 25.91† 27.73† A l (2016) 39 39 27 27 41 41 87 87* 30 30 62 62* 31 40* 5k dict. Artetxe et al. (2016) 39 39.27 27 41 41.87 87* 30 30.62 62* 31.40* 25 dict. Our method (ACL17) None

177

slide-118
SLIDE 118

Unsupervised experiments (ACL18)

  • Dataset by Dinu et al. (2015) extended German, Finnish, Spanish

⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / none ⇒ Test dictionary: 1,500 word pairs

Super upervisi ision Method ethod EN‐IT IT EN EN‐DE DE EN EN‐FI FI EN EN‐ES ES Mikolov et al. (2013) 34.93† 35.00† 25.91† 27.73† A l (2016) 39 39 27 27 41 41 87 87* 30 30 62 62* 31 40* 5k dict. Artetxe et al. (2016) 39 39.27 27 41 41.87 87* 30 30.62 62* 31.40* 25 dict. Our method (ACL17) 37.27 39.60 28.16 ‐ None

178

slide-119
SLIDE 119

Unsupervised experiments (ACL18)

  • Dataset by Dinu et al. (2015) extended German, Finnish, Spanish

⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / none ⇒ Test dictionary: 1,500 word pairs

Super upervisi ision Method ethod EN‐IT IT EN EN‐DE DE EN EN‐FI FI EN EN‐ES ES 5k dict. 25 dict. None Zhang et al. (2017) Conneau et al. (2018)

180

( )

slide-120
SLIDE 120

Unsupervised experiments (ACL18)

  • Dataset by Dinu et al. (2015) extended German, Finnish, Spanish

⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / none ⇒ Test dictionary: 1,500 word pairs

Previ evious us work conver ergence gence probl blems! ems! Al Also so obser

  • bserved

ed by by Sog Sogard et et al al (2018) (2018)

Super upervisi ision Method ethod EN‐IT IT EN EN‐DE DE EN EN‐FI FI EN EN‐ES ES

Al Also so obser

  • bserved

ed by by Sog Sogard et et al

  • al. (2018)

(2018)

5k dict. 25 dict. None Zhang et al. (2017) 0.00 0.00 0.01 0.01 Conneau et al. (2018) 13.55 42.15 0.38 21.23

181

( )

slide-121
SLIDE 121

Unsupervised experiments (ACL18)

  • Dataset by Dinu et al. (2015) extended German, Finnish, Spanish

⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / none ⇒ Test dictionary: 1,500 word pairs

Super upervisi ision Method ethod EN‐IT IT EN EN‐DE DE EN EN‐FI FI EN EN‐ES ES Mikolov et al. (2013) 34.93† 35.00† 25.91† 27.73† A l (2016) 39 27 41 87* 30 62* 31 40* 5k dict. Artetxe et al. (2016) 39.27 41.87* 30.62* 31.40* Smith et al. (2017) 43.1 43.33† 29.42† 35.13† Our method (AAAI18) 45.27 44.13 32.94 32.94 36.60 25 dict. Our method (ACL17) 37.27 39.60 28.16 ‐ None Zhang et al. (2017) 0.00 0.00 0.01 0.01 Conneau et al. (2018) 13.55 42.15 0.38 21.23

182

( ) Our method (ACL18) 48.13 48.13 48.19 48.19 32.63 37.33 37.33

slide-122
SLIDE 122

Unsupervised experiments (ACL18)

  • Dataset by Dinu et al. (2015) extended German, Finnish, Spanish

⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / none ⇒ Test dictionary: 1,500 word pairs

Super upervisi ision Method ethod EN‐IT IT EN EN‐DE DE EN EN‐FI FI EN EN‐ES ES Mikolov et al. (2013) 34.93† 35.00† 25.91† 27.73† A l (2016) 39 27 41 87* 30 62* 31 40* 5k dict. Artetxe et al. (2016) 39.27 41.87* 30.62* 31.40* Smith et al. (2017) 43.1 43.33† 29.42† 35.13† Our method (AAAI18) 45.27 44.13 32.94 32.94 36.60 25 dict. Our method (ACL17) 37.27 39.60 28.16 ‐ None Zhang et al. (2017) 0.00 0.00 0.01 0.01 Conneau et al. (2018) 13.55 42.15 0.38 21.23

183

( ) Our method (ACL18) 48.13 48.13 48.19 48.19 32.63 37.33 37.33

slide-123
SLIDE 123

Conclusions

184

slide-124
SLIDE 124

Conclusions

  • Simple self‐learning method to train bilingual embedding mappings
  • Unsupervised matches results of supervised methods!
  • Unsupervised matches results of supervised methods!
  • Implicit optimization objective independent from seed dictionary

185

slide-125
SLIDE 125

Conclusions

  • Simple self‐learning method to train bilingual embedding mappings
  • Unsupervised matches results of supervised methods!
  • Unsupervised matches results of supervised methods!
  • Implicit optimization objective independent from seed dictionary
  • High quality dictionaries:

High quality dictionaries: Manual analysis shows that real accuracy > 60% High frequency words up to 80%

186

slide-126
SLIDE 126

Conclusions

  • Simple self‐learning method to train bilingual embedding mappings
  • Unsupervised matches results of supervised methods!
  • Unsupervised matches results of supervised methods!
  • Implicit optimization objective independent from seed dictionary
  • High quality dictionaries:

High quality dictionaries: Manual analysis shows that real accuracy > 60% High frequency words up to 80%

  • Full

Full repr produci

  • ducibility

lity (i (incl ncludi uding ng da datase sets): ts):

https://github.com/artetxem/vecmap 187

slide-127
SLIDE 127

Conclusions

  • Simple self‐learning method to train bilingual embedding mappings
  • Unsupervised matches results of supervised methods!
  • Unsupervised matches results of supervised methods!
  • Implicit optimization objective independent from seed dictionary
  • High quality dictionaries:

High quality dictionaries: Manual analysis shows that real accuracy > 60% High frequency words up to 80%

  • Full

Full repr produci

  • ducibility

lity (i (incl ncludi uding ng da datase sets): ts):

https://github.com/artetxem/vecmap

  • Shows that languages share “semantic” structure to a large degree

190

slide-128
SLIDE 128

References: cross‐lingual mappings

  • Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. Gener

Generalizing zing and and Im Impr proving Bi Bilingual ngual Wo Word Embeddi Embedding Ma Mappings gs wi with a Mu Multi‐ and and Im Impr proving Bi Bilingual ngual Wo Word Embeddi Embedding Ma Mappings gs wi with a Mu Multi St Step ep Fr Fram amework

  • rk of Linear

inear Trans nsform rmatio ions

  • ns. In AAAI‐2018.
  • Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learni

Learning ng bil bilingual ngual wor

  • rd embeddings

eddings with ith (alm almost) no bil ilingual ingual da

  • data. In ACL‐

2017.

  • Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. A r

A robust s t self‐ learnin learning meth thod for fully ully unsup upervise vised cross ss‐lingual ngual mappi appings ngs of d b ddi ddi I ACL 2018 wo word em embeddi ddings

  • ngs. In ACL‐2018.

191

slide-129
SLIDE 129

Outline

  • Bilingual embedding mappings

I t d ti t t d l ( b ddi )

  • Introduction to vector space models (embeddings)
  • Bilingual embedding mappings (AAAI18)

d d

  • Reduced supervision
  • Self‐learning, semi‐supervised (ACL17)

lf l f ll d ( )

  • Self‐learning, fully unsupervised (ACL18)
  • Conclusions
  • Unsupervised neural machine translation
  • Introduction to NMT
  • From bilingual embeddings to uNMT (ICLR18)
  • Unsupervised statistical MT (EMNLP18)

p ( )

  • Conclusions

192

slide-130
SLIDE 130

Introduction to (supervised) NMT

193

slide-131
SLIDE 131

Introduction to (supervised) NMT

  • Given pairs of sentences with known translation (x1…xn, y1…ym)

This is my dearest dog </s> Este es mi perro preferido </s>

194

slide-132
SLIDE 132

Introduction to (supervised) NMT

  • Given pairs of sentences with known translation (x1…xn, y1…ym)

This is my dearest dog </s> Este es mi perro preferido </s>

  • Train an enc

encoder der based on Recurrent Neural Nets return all hidden states, encoding input x1…xn

1 n

195

slide-133
SLIDE 133

Introduction to (supervised) NMT

  • Given pairs of sentences with known translation (x1…xn, y1…ym)

This is my dearest dog </s> Este es mi perro preferido </s>

  • Train an enc

encoder der based on Recurrent Neural Nets return all hidden states, encoding input x1…xn

1 n

  • Train a dec

decoder der based on Recurrent Neural Nets ‐ based on hidden states and last word in translation y based on hidden states and last word in translation yi-1 ‐ plus an atte attent ntion mechanism ‐ classifier guesses next word y classifier guesses next word yi

196

slide-134
SLIDE 134

Introduction to (supervised) NMT

  • Given pairs of sentences with known translation (x1…xn, y1…ym)

This is my dearest dog </s> Este es mi perro preferido </s>

  • Train an enc

encoder der based on Recurrent Neural Nets return all hidden states, encoding input x1…xn

1 n

  • Train a dec

decoder der based on Recurrent Neural Nets ‐ based on hidden states and last word in translation y based on hidden states and last word in translation yi-1 ‐ plus an atte attent ntion mechanism ‐ classifier guesses next word y classifier guesses next word yi End End‐to to‐end end tr trainin aining

197

slide-135
SLIDE 135

Introduction to (supervised) NMT

198

Source: Wu et al. 2016 (~ 30 authors – Also known as Google NMT)

slide-136
SLIDE 136

Introduction to (supervised) NMT

Encoder for L1 L2 decoder n softmax

. . . . . .

ttention

L1 embeddings

at

199

slide-137
SLIDE 137

Unsupervised neural machine translation

  • Now that we ca

can re represent wo word rds in in tw two languag languages in in the the same same embeddings embeddings space space without bilingual dictionaries… … what can we do? 200

slide-138
SLIDE 138

Unsupervised neural machine translation

  • Now that we ca

can re represent wo word rds in in tw two languag languages in in the the same same embeddings embeddings space space without bilingual dictionaries… … what can we do?

  • We change the ar

archit chitectur ecture of

  • f the

the NM NMT sy system:

  • Handle both directions together (L1 ‐> L2, L2 ‐> L1)
  • Shared encoder for the two languages (E)

Shared encoder for the two languages (E)

  • Two decoders for each language (D1, D2)
  • Fixed embeddings

201

slide-139
SLIDE 139

Unsupervised neural machine translation

202

slide-140
SLIDE 140

Unsupervised neural machine translation

203

slide-141
SLIDE 141

Unsupervised neural machine translation

204

slide-142
SLIDE 142

Unsupervised neural machine translation

  • We change the ar

archit chitectur ecture of

  • f the

the NM NMT sy system:

  • Handle both directions together (L1 ‐> L2, L2 ‐> L1)
  • Shared encoder for the two languages (E)

Shared encoder for the two languages (E)

  • Two decoders for each language (D1, D2)
  • Fixed embeddings
  • We change the tr

training aining re regime, mi mixing ng mi mini ni‐ba batches hes:

i i d i i i L1 i h l ( 1)

  • Denoising autoencoder: noisy input in L1, output in the same language (E+D1)
  • Denoising autoencoder: noisy input in L2, output in the same language (E+D2)
  • Backtranslation: input in L1, translate E+D2, translate E+D1, output in L1

205

p , , , p

  • Backtranslation: input in L2, translate E+D1, translate E+D2, output in L2
slide-143
SLIDE 143

Unsupervised neural machine translation

Training aining 206

slide-144
SLIDE 144

Unsupervised neural machine translation

Training aining

Une fusillade a eu lieu à l’aéroport international p de Los Angeles.

207

slide-145
SLIDE 145

Unsupervised neural machine translation

Training aining

Supervised Supervised

Une fusillade a eu lieu à l’aéroport international p de Los Angeles.

208

slide-146
SLIDE 146

Unsupervised neural machine translation

Training aining

Supervised Supervised

Une fusillade a eu lieu à l’aéroport international There was a shooting in Los Angeles International Airport. p de Los Angeles.

209

slide-147
SLIDE 147

Unsupervised neural machine translation

Training aining

Supervised Supervised

Une fusillade a eu lieu à l’aéroport international There was a shooting in Los Angeles International Airport. p de Los Angeles.

210

slide-148
SLIDE 148

Unsupervised neural machine translation

Training aining

Supervised Supervised

Une fusillade a eu lieu à l’aéroport international There was a shooting in Los Angeles International Airport. p de Los Angeles.

211

slide-149
SLIDE 149

Unsupervised neural machine translation

Training aining

Supervised Supervised

Une fusillade a eu lieu à l’aéroport international p de Los Angeles.

212

slide-150
SLIDE 150

Unsupervised neural machine translation

Training aining

Supervised Supervised Autoencoder

Une fusillade a eu lieu à l’aéroport international de Los Angeles. Une fusillade a eu lieu à l’aéroport international p de Los Angeles.

213

slide-151
SLIDE 151

Unsupervised neural machine translation

Training aining

Supervised Supervised Autoencoder

Une fusillade a eu lieu à l’aéroport international de Los Angeles. Une fusillade a eu lieu à l’aéroport international p de Los Angeles.

214

slide-152
SLIDE 152

Unsupervised neural machine translation

Training aining

Supervised Supervised Denoising Autoencoder

Une fusillade a eu lieu à l’aéroport international de Los Angeles. Une lieu fusillade a eu à l’aéroport de Los p international Angeles.

215

slide-153
SLIDE 153

Unsupervised neural machine translation

Training aining

Supervised Supervised Denoising Autoencoder

Une fusillade a eu lieu à l’aéroport international de Los Angeles. Une lieu fusillade a eu à l’aéroport de Los p international Angeles.

216

slide-154
SLIDE 154

Unsupervised neural machine translation

Training aining

Supervised Supervised Denoising Autoencoder

There a shooting was in Airport Los Angeles International. There was a shooting in Los Angeles International Airport.

217

slide-155
SLIDE 155

Unsupervised neural machine translation

Training aining

Supervised Supervised Denoising Autoencoder

There a shooting was in Airport Los Angeles International. There was a shooting in Los Angeles International Airport.

218

slide-156
SLIDE 156

Unsupervised neural machine translation

Training aining

Supervised Supervised Denoising Backtranslation

Une fusillade a eu lieu à l’aéroport international p de Los Angeles.

219

slide-157
SLIDE 157

Unsupervised neural machine translation

Training aining

Supervised Supervised Denoising Backtranslation

Une fusillade a eu lieu à l’aéroport international p de Los Angeles.

220

slide-158
SLIDE 158

Unsupervised neural machine translation

Training aining

Supervised Supervised Denoising Backtranslation

Une fusillade a eu lieu à l’aéroport international A shooting has had place in airport international of Los Angeles. p de Los Angeles.

221

slide-159
SLIDE 159

Unsupervised neural machine translation

Training aining

Supervised Supervised Denoising Backtranslation

Une lieu fusillade a eu à l’aéroport de Los A shooting has had place in airport international of Los Angeles. p international Angeles.

222

slide-160
SLIDE 160

Unsupervised neural machine translation

Training aining

Supervised Supervised Denoising Backtranslation

Une fusillade a eu lieu à l’aéroport international de Los Angeles. Une lieu fusillade a eu à l’aéroport de Los A shooting has had place in airport international of Los Angeles. p international Angeles.

223

slide-161
SLIDE 161

Unsupervised neural machine translation

  • We change the ar

archit chitectur ecture of

  • f the

the NM NMT sy system:

  • Handle both directions together (L1 ‐> L2, L2 ‐> L1)
  • Shared encoder for the two languages (E)

Shared encoder for the two languages (E)

  • Two decoders for each language (D1, D2)
  • Fixed embeddings
  • We change the tr

training aining re regime, mi mixing ng mi mini ni‐ba batches hes:

i i d i i i L1 i h l ( 1)

  • Denoising autoencoder: noisy input in L1, output in the same language (E+D1)
  • Denoising autoencoder: noisy input in L2, output in the same language (E+D2)
  • Backtranslation: input in L1, translate E+D2, translate E+D1, output in L1

224

p , , , p

  • Backtranslation: input in L2, translate E+D1, translate E+D2, output in L2
slide-162
SLIDE 162

Unsupervised neural machine translation

Only WMT released data (test and monolingual corpora)

FR FR‐EN EN EN EN‐FR FR DE DE‐EN EN EN EN‐DE DE

y g p

Unsupervised NMT

226

slide-163
SLIDE 163

Unsupervised neural machine translation

Only WMT released data (test and monolingual corpora)

FR FR‐EN EN EN EN‐FR FR DE DE‐EN EN EN EN‐DE DE

y g p

Unsupervised NMT Baseline (emb. nearest neighbor) 9.98 6.25 7.07 4.39

227

slide-164
SLIDE 164

Unsupervised neural machine translation

Only WMT released data (test and monolingual corpora)

FR FR‐EN EN EN EN‐FR FR DE DE‐EN EN EN EN‐DE DE

y g p

Unsupervised NMT Baseline (emb. nearest neighbor) 9.98 6.25 7.07 4.39 Proposed (denoising) 7.28 5.33 3.64 2.40

228

slide-165
SLIDE 165

Unsupervised neural machine translation

Only WMT released data (test and monolingual corpora)

FR FR‐EN EN EN EN‐FR FR DE DE‐EN EN EN EN‐DE DE

y g p

Unsupervised NMT Baseline (emb. nearest neighbor) 9.98 6.25 7.07 4.39 Proposed (denoising) 7.28 5.33 3.64 2.40 Proposed (+backtranslation) 15 15 56 56 15 15 13 13 10 10 21 21 6 5 6 55 Proposed (+backtranslation) 15 15.56 56 15 15.13 13 10 10.21 21 6.55 55

It It wo works! It It wo works!

230

slide-166
SLIDE 166

Unsupervised neural machine translation

Only WMT released data (test and monolingual corpora)

FR FR‐EN EN EN EN‐FR FR DE DE‐EN EN EN EN‐DE DE

y g p

Unsupervised NMT Baseline (emb. nearest neighbor) 9.98 6.25 7.07 4.39 Proposed (denoising) 7.28 5.33 3.64 2.40 Proposed (+backtranslation) 15 56 15 13 10 21 6 55 Proposed (+backtranslation) 15.56 15.13 10.21 6.55 Semi‐supervised NMT Proposed (full) + 10k parallel 18.57 18.57 17.34 17.34 11.47 11.47 7.86 7.86 Proposed (full) + 100k parallel 21.81 21.81 21.74 21.74 15.24 15.24 10.95 10.95

It It ca can be be easil easily co comb mbined wi with tr training da data y g (inter eres esting ting fo for lo low re resource MT MT)

233

slide-167
SLIDE 167

Unsupervised neural machine translation

Only WMT released data (test and monolingual corpora)

FR FR‐EN EN EN EN‐FR FR DE DE‐EN EN EN EN‐DE DE

y g p

Unsupervised NMT Baseline (emb. nearest neighbor) 9.98 6.25 7.07 4.39 Proposed (denoising) 7.28 5.33 3.64 2.40 Proposed (+backtranslation) 15 15 56 56 15 15 13 13 10 10 21 21 6 5 6 55 Proposed (+backtranslation) 15 15.56 56 15 15.13 13 10 10.21 21 6.55 55 Lample et al. 2018* (Same conference!) 14.31 15.06 ‐ ‐

*Experimental conditions are not exactly the same

State‐of‐the‐art (not anymore…)

237

slide-168
SLIDE 168

Unsupervised neural machine translation

Only WMT released data (test and monolingual corpora)

FR FR‐EN EN EN EN‐FR FR DE DE‐EN EN EN EN‐DE DE

y g p

Unsupervised NMT Baseline (emb. nearest neighbor) 9.98 6.25 7.07 4.39 Proposed (denoising) 7.28 5.33 3.64 2.40 Proposed (+backtranslation) 15 15 56 56 15 15 13 13 10 10 21 21 6 5 6 55 Proposed (+backtranslation) 15 15.56 56 15 15.13 13 10 10.21 21 6.55 55 Lample et al. 2018 14.31 15.06 ‐ ‐ Lample et al. 2018b

Lample et al. 2018b (EMNLP)

  • No embedding mappings

No embedding mappings

  • BPE jointly over monolingual corpora. Fails for less related languages (Russian).
  • Shared decoder for both languages

238

  • Transformer (instead of LSTM)
slide-169
SLIDE 169

Unsupervised neural machine translation

Only WMT released data (test and monolingual corpora)

FR FR‐EN EN EN EN‐FR FR DE DE‐EN EN EN EN‐DE DE

y g p

Unsupervised NMT Baseline (emb. nearest neighbor) 9.98 6.25 7.07 4.39 Proposed (denoising) 7.28 5.33 3.64 2.40 Proposed (+backtranslation) 15 56 15 13 10 21 6 55 Proposed (+backtranslation) 15.56 15.13 10.21 6.55 Lample et al. 2018 14.31 15.06 ‐ ‐ Lample et al. 2018b 24.2 24.2 25.1 25.1 ‐ ‐

Lample et al. 2018b (EMNLP)

  • No embedding mappings

No embedding mappings

  • BPE jointly over monolingual corpora. Fails for less related languages (Russian).
  • Shared decoder for both languages

239

  • Transformer (instead of LSTM)
slide-170
SLIDE 170

Unsupervised neural machine translation

Only WMT released data (test and monolingual corpora)

FR FR‐EN EN EN EN‐FR FR DE DE‐EN EN EN EN‐DE DE

y g p

Unsupervised NMT Baseline (emb. nearest neighbor) 9.98 6.25 7.07 4.39 Proposed (denoising) 7.28 5.33 3.64 2.40 Proposed (+backtranslation) 15 56 15 13 10 21 6 55 Proposed (+backtranslation) 15.56 15.13 10.21 6.55 Lample et al. 2018 14.31 15.06 ‐ ‐ Lample et al. 2018b 24.2 25.1 ‐ ‐ L l d C 2019 33 33 3 33 33 4 Lample and Conneau 2019 33 33.3 33 33.4 ‐ ‐

Lample and Conneau 2019 (Arxiv)

  • Similar to 2018b
  • Pre‐train encoder and decoder on (masked) language modeling
  • And

larger batch sizes (+6 Bleu!) 240

  • And … larger batch sizes (+6 Bleu!)
slide-171
SLIDE 171

Unsupervised statistical machine translation

241

slide-172
SLIDE 172

Unsupervised statistical machine translation

l b ( ) Artetxe et al. 2018b (EMNLP)

  • Estimate PBMT parameters
  • Learn monolingual embeddings for bigrams and trigrams
  • Learn monolingual embeddings for bigrams and trigrams
  • Initialize phrase table using prob. estimates from cross‐lingual mappings
  • Unsupervised tuning based on back‐translation

242 p g

  • Use backtranslation and train reverse PBMT from scratch. Iterate.
slide-173
SLIDE 173

Unsupervised statistical machine translation

Only WMT released data (test and monolingual corpora). WMT14 and WMT16

FR FR‐EN EN EN EN‐FR FR DE DE‐EN EN EN EN‐DE DE DE DE‐EN EN EN EN‐DE DE

y g p

Unsupervised NMT Artetxe et al. 2018 15.56 15.13 10.21 6.55 Lample and C. 2019 33.3 33.3 33.4 33.4 34.3 34.3 26.2 26.2 Unsupervised Unsupervised PBMT

l b ( ) Artetxe et al. 2018b (EMNLP)

  • Estimate PBMT parameters
  • Learn monolingual embeddings for bigrams and trigrams
  • Learn monolingual embeddings for bigrams and trigrams
  • Initialize phrase table using prob. estimates from cross‐lingual mappings
  • Unsupervised tuning based on back‐translation

243 p g

  • Use backtranslation and train reverse PBMT from scratch. Iterate.
slide-174
SLIDE 174

Unsupervised statistical machine translation

Only WMT released data (test and monolingual corpora). WMT14 and WMT16

FR FR‐EN EN EN EN‐FR FR DE DE‐EN EN EN EN‐DE DE DE DE‐EN EN EN EN‐DE DE

y g p

Unsupervised NMT Artetxe et al. 2018 15.56 15.13 10.21 6.55 Lample and C. 2019 33.3 33.3 33.4 33.4 34.3 34.3 26.2 26.2 Unsupervised Artetxe et al 2018b 25 87 26 22 17 43 14 08 23 05 18 23 Unsupervised Artetxe et al. 2018b 25.87 26.22 17.43 14.08 23.05 18.23 PBMT

l b ( ) Artetxe et al. 2018b (EMNLP)

  • Estimate PBMT parameters
  • Learn monolingual embeddings for bigrams and trigrams
  • Learn monolingual embeddings for bigrams and trigrams
  • Initialize phrase table using prob. estimates from cross‐lingual mappings
  • Unsupervised tuning based on back‐translation

244 p g

  • Use backtranslation and train reverse PBMT from scratch. Iterate.
slide-175
SLIDE 175

Unsupervised statistical machine translation

Only WMT released data (test and monolingual corpora). WMT14 and WMT16

FR FR‐EN EN EN EN‐FR FR DE DE‐EN EN EN EN‐DE DE DE DE‐EN EN EN EN‐DE DE

y g p

Unsupervised NMT Artetxe et al. 2018 15.56 15.13 10.21 6.55 Lample and C. 2019 33.3 33.3 33.4 33.4 34.3 34.3 26.2 26.2 Unsupervised Artetxe et al 2018b 25 87 26 22 17 43 14 08 23 23 05 05 18 18 23 23 Unsupervised Artetxe et al. 2018b 25.87 26.22 17.43 14.08 23 23.05 05 18 18.23 23 PBMT Lample et al. 2018b 27.16 27.16 28.11 28.11 22.68 17.77

l b ( ) Artetxe et al. 2018b (EMNLP)

  • Estimate PBMT parameters
  • Learn monolingual embeddings for bigrams and trigrams
  • Learn monolingual embeddings for bigrams and trigrams
  • Initialize phrase table using prob. estimates from cross‐lingual mappings
  • Unsupervised tuning based on back‐translation

245 p g

  • Use backtranslation and train reverse PBMT from scratch. Iterate.
slide-176
SLIDE 176

Unsupervised statistical machine translation

Only WMT released data (test and monolingual corpora). WMT14 and WMT16

FR FR‐EN EN EN EN‐FR FR DE DE‐EN EN EN EN‐DE DE DE DE‐EN EN EN EN‐DE DE

y g p

Unsupervised NMT Artetxe et al. 2018 15.56 15.13 10.21 6.55 Lample and C. 2019 33.3 33.4 34.3 26.2 Unsupervised Artetxe et al 2018b 25 87 26 22 17 43 14 08 23 05 18 23 Unsupervised Artetxe et al. 2018b 25.87 26.22 17.43 14.08 23.05 18.23 PBMT Lample et al. 2018b 27.16 28.11 22.68 17.77 NMT + Lample et al. 2018b 27.7 27.6 25.2 20.2 PBMT A t t t l SUBM 33 33 5 36 36 2 27 27 0 22 22 5 34 34 4 26 26 9 PBMT Artetxe et al. SUBM 33 33.5 36 36.2 27 27.0 22 22.5 34 34.4 26 26.9

Combina inatio ions ns improve further UMT is at the level MT was at 2014 246

slide-177
SLIDE 177

Unsupervised machine translation

Getting closer to supervised machine translation!

247

Source: (Lample et al. 2018)

slide-178
SLIDE 178

Unsupervised machine translation

Getting closer to supervised machine translation!

248

Source: (Lample et al. 2018)

slide-179
SLIDE 179

Why does it work?

249

slide-180
SLIDE 180

Why does it work?

Early to say… but intuition:

251

slide-181
SLIDE 181

Why does it work?

Early to say… but intuition:

  • Mapped embedding space provides

information for k‐best possible translations information for k‐best possible translations

  • NMT / PBMT

figures out how to best “combine” them

252

slide-182
SLIDE 182

Conclusions

  • New research area – unsupervised Machine Translation

The main Machine Translation competition (WMT) has now an unsuper unsupervised ised sub subtrack ck

  • Performance up, 33 BLEU En‐Fr
  • Plenty of margin for improvement
  • Code for replicability

p y

https://github.com/artetxem/undreamt https://github.com/artetxem/monoses (soon) 260

slide-183
SLIDE 183

References: unsupervised MT

  • Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017.

Unsu Unsuper ervi vised sed Neur Neural al Ma Machine Tr

  • Translation. In ICLR‐2018.

p

  • Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018.

, , g Uns Unsuper upervised ed St Statis istic tical Ma Machine Tr

  • Translation. In EMNLP‐2018.

261

slide-184
SLIDE 184

Final words

  • Wo

Word embeddi embeddings ngs key ey for Natural Language Processing

  • Mappings represent lan

languages in in co commo mmon space space

  • Mappings represent lan

languages in in co commo mmon space space

  • Most of language pairs have ve

very fe few re resources

  • New research area: onl
  • nly monol

monolingual ngual re resources

  • New research area: onl
  • nly monol

monolingual ngual re resources

262

slide-185
SLIDE 185

Final words

  • Wo

Word embeddi embeddings ngs key ey for Natural Language Processing

  • Mappings represent lan

languages in in co commo mmon space space

  • Mappings represent lan

languages in in co commo mmon space space

  • Most of language pairs have ve

very fe few re resources

  • New research area: onl
  • nly monol

monolingual ngual re resources

  • New research area: onl
  • nly monol

monolingual ngual re resources

  • Cr

Cross

  • ss‐lingual

ngual unsuper unsupervised ed mappi mappings ngs enabled breakthroughs in Bili l di ti i d ti

  • Bilingual dictionary induction
  • Unsupervised machine translation

263

slide-186
SLIDE 186

Final words

  • Wo

Word embeddi embeddings ngs key ey for Natural Language Processing

  • Mappings represent lan

languages in in co commo mmon space space

  • Mappings represent lan

languages in in co commo mmon space space

  • Most of language pairs have ve

very fe few re resources

  • New research area: onl
  • nly monol

monolingual ngual re resources

  • New research area: onl
  • nly monol

monolingual ngual re resources

  • Cr

Cross

  • ss‐lingual

ngual unsuper unsupervised ed mappi mappings ngs enabled breakthroughs in Bili l di ti i d ti

  • Bilingual dictionary induction
  • Unsupervised machine translation

U l d i it i f i f

  • Unexplored area in its infancy

ancy

  • Potential for MT

MT in in lo low re resource lan languages and and dom domains ins l f f h l d l d

  • Potential for tr

trans ansformi

  • rming

ng the NLP NLP lan andsc scap ape

  • From monolingual NLP (e.g. English) to multilin

ltilingual al to tool

  • ls

264

  • Uni

Univer ersal sal sen senten ence ce re representat tations

slide-187
SLIDE 187

Thank you! Thank you!

@eagirre http://ixa2 si ehu eus/eneko http://ixa2.si.ehu.eus/eneko https://github.com/artetxem/vecmap https://github.com/artetxem/undreamt

265

https://github.com/artetxem/undreamt https://github.com/artetxem/monoses