Zero-Shot Learning for Word Translation: Successes and Failures
Ndapa Nakashole, University of California, San Diego 05 June 2018
Zero-Shot Learning for Word Translation: Successes and Failures - - PowerPoint PPT Presentation
Zero-Shot Learning for Word Translation: Successes and Failures Ndapa Nakashole, University of California, San Diego 05 June 2018 Outline Introduction Successes Limitations 2 Zero-shot learning Zero-shot learning: ) at test
Zero-Shot Learning for Word Translation: Successes and Failures
Ndapa Nakashole, University of California, San Diego 05 June 2018
Outline
2
Zero-shot learning
3
Zero-shot learning: = ) at test time can encounter an instance whose corresponding label was not seen at training time xj 2 Xtest yj 62 Y ZL setting occurs in domains with many possible labels
Zero-shot learning: Unseen labels
4
To deal with labels that have no training data ⇤ Instead of learning parameters associated with each label y∈Y ⇤ Treat as problem of learning a single projection function Resulting function can then map input vectors to label space
Zero-shot Learning: Cross-Modal Mapping
5
Socher et al. 2013
Cross-lingual mapping
6
PT EN
First generate monolingual word embeddings for each language Learned from large unlabeled text corpora Second, learn to map between embedding spaces of different languages
Multilingual word embeddings
7
Creates multilingual word embeddings Similar words are nearby points regardless of language
EN
PT PT
Multilingual word embeddings uses: ⇤ Model transfer ⇤ Recent: initialize unsupervised machine translation Shared vector space
Problem
– that projects vectors from embedding space of one language to another
8
Outline
9
Success
10
Early work & assumptions
11
Concepts have similar geometric arrangements in vector spaces of different languages (Mikolov et al. 2013). Assumption: mapping function is linear
Linear Mapping Function
12
ˆ M = arg minM ||MX − Y||F + λ||M|| y = arg maxy cos(Mx, y)
Improving accuracy
– Xing et al. 2015, Zhang et al 2016
– Lazaridou et al. 2015
13
Reducing supervision
14
ˆ yi
(es)
ˆ yi
(en)
x(pt)
i W(es→en) W(pt→es) W(pt→en)
−
2017)
– Start with a small dictionary – Iteratively build it up while learning map function
No supervision
al., 2017; Conneau et al., 2018)
– Adversarial training – Discriminator: separate mapped vectors Mx from targets Y – Generator (learned map): prevent discriminator from succeeding
15
Success Summary
– However, there’s room for improvement
16
Outline
17
Limitations
Assumptions
– A1. Maps are linear (linearity) – A2. Embedding spaces are similar (isomorphism)
18
Assumption of Linearity
– Artexte et al. 2018, Conneau et al. 2018, …, Nakashole 2017, … Mikolov et al. 2013
– Unclear to what extent the assumption of linearity holds
– Currently not SOTA – Trying to optimize multi-layer neural networks for this zero-shot learning problem largely fails
19
Testing Linearity
20
Testing Linearity
– but can be approximated by linear maps in small enough neighborhoods
– local approximations should be identical or similar
– local approximations will vary across neighborhoods
21
22
x Mx
M
(en) (de)
23
x Mx
M
(en) (de)
x
(en) (de)
Mx0 Mxn
xn
Neighborhoods in Word Vector Space
– Pick an ‘anchor’ word, consider all nearby words (cos sim>=0.5) to be in its neighborhood
24
s = . 1 : c
e n h a g e n , d i n
a u r ,
c h i d s , . . . s = . 6 : a n t i
i
i c , d
a g e , . . . s = . 8 : d i e t a r y , n u t r i t i
, . . . m u l t i v i t a m i n s
Neighborhoods: en-de
25
x0:multivitamins x1:antibiotic x2:disease x3:blowflies x4:dinosaur x5:orchids x6:copenhagen cos(x0, xi) 1.0 0.60 0.45 0.33 0.24 0.19 0.11
Neighborhood maps
26
Setting 1: train a single map (MX0)
27
Translation Accuracy x0:multivitamins x1:antibiotic x2:disease x3:blowflies x4:dinosaur x5:orchids x6:copenhagen Mx0 68.2 67.3 59.2 28.4 14.7 19.3 31.2 x0 Similarity cos(x0, xi) 1.0 0.60 0.45 0.33 0.24 0.19 0.11
x0:multivitamins x1:antibiotic x2:disease x3:blowflies x4:dinosaur x5:orchids x6:copenhagen
Setting 2: a map for every neighborhood (MXi)
28
Translation Accuracy Mx0 Mxi ∆ 68.2 68.2 67.3 72.7 5.4 ↑ 59.2 73.4 14.2 ↑ 28.4 73.2 44.8 ↑ 14.7 77.1 62.4 ↑ 19.3 78.0 58.7 ↑ 31.2 67.4 36.2 ↑ x0 Similarity cos(x0, xi) 1.0 0.60 0.45 0.33 0.24 0.19 0.11
Testing Linearity Assumption
– local approximations should be identical or similar
– local approximations will vary across neighborhoods
29
Map Similarity
30
cos(M1, M2) = tr(M1
T M2)
q tr(M1
T M1)tr(M2 T M2)
x0 Similarity cos(x0, xi) 1.0 0.60 0.45 0.33 0.24 0.19 0.11 x0:multivitamins x1:antibiotic x2:disease x3:blowflies x4:dinosaur x5:orchids x6:copenhagen cos(Mx0, Mxi) 1.0 0.59 0.31 0.20 0.14 0.20 0.15
Translate (Xi) neighborhood using (MX0)
31
Setting 3: train a single global map (M)
32
x0:multivitamins x1:antibiotic x2:disease x3:blowflies x4:dinosaur x5:orchids x6:copenhagen M Mx0 Mxi 58.3 68.2 68.2 61.1 67.3 72.7 69.3 59.2 73.4 71.4 28.4 73.2 63.2 14.7 77.1 73.7 19.3 78.0 38.5 31.2 67.4 x0 Similarity cos(x0, xi) 1.0 0.60 0.45 0.33 0.24 0.19 0.11 Translation Accuracy
33
Linearity Assumption: Summary
– by an amount tightly correlated with distance between neighborhoods on which they were trained
34
But SOTA achieves remarkable precision
ICLR 2018)
– BUT only for closely related languages, e.g, EN-ES
– Precision much lower, ~ 40% EN-RU, ~30% EN-ZH
35
Assumptions
– A1. Maps are linear (linearity) – A2. Embedding spaces are similar (isomorphism)
36
close vs distant language translation
37
State-of-the-Art
38
en-ru en-zh en-de en-es en-fr Artetxe et al . 2018 47.93 20.4 70.13 79.6 79.30 Conneau et al. 2018 37.30 30.90 71.30 79.10 78.10 Smith et al. 2017 46.33 39.60 69.20 78.80 78.13
Proposed approach
– learn neighborhood sensitive maps
39
Learn neighborhood sensitive maps
– Currently not SOTA – Trying to optimize multi-layer neural networks for this zero-shot learning problem largely fails
40
Jointly discover neighborhoods & translate
– while learning to translate
41
Reconstructive Neighborhood Discovery
42
D, V = arg min
D,V
||X − VD||2
2
XF = XDT
neighborhoods
– Reconstruct word vector xi using a linear combination of K neighborhoods. – Dictionary that minimizes reconstruction error (Lee et al 2007)
Maps
43
ˆ ylinear
i
= WxFi hi = σ1(xFiW) ti = σ2(xFiWt) ˆ ynn
i
= ti × hi + (1.0 − ti) × xFi L(θ) =
m
P
i=1 k
P
j6=i
max ⇣ 0, γ + d(yi, ˆ yg
i ) −
d(yj, ˆ yg
i )
⌘ ,
44
en-ru en-zh en-de en-es en-fr 50.33 43.27 68.50 77.47 76.10 Artetxe et al . 2018 47.93 20.4 70.13 79.6 79.30 Conneau et al. 2018 37.30 30.90 71.30 79.10 78.10 Smith et al. 2017 46.33 39.60 69.20 78.80 78.13
Rare Words
45
Rare vs frequent words: en-pt
46
en-pt RARE MUSE 57.67 72.60 49.33 71.73 49.33 72.10 47.00 77.73 48.00 72.27 Artetxe et al . 2018 Lazaridou et al 2015
Neighborhood interpretability
47
Neighborhood 51 134 162 7 drugs criminally chuanyao khoisan zonisamide judicature chuanyan bantu cocaine prosecutory zhiang sepedi ritalin derogation thanong
hospitalized restitutionary qiangbing ndebeles pheniprazine derogative pengpeng hereros
jailable nguyan
disorientation extradition yuning shona focusyn sodomy liheng hutu alfaxalone crimes thanong witotoan
48
Conclusion
49