Multimodal Machine Translation with Embedding Prediction Tosho - - PowerPoint PPT Presentation

multimodal machine translation with embedding prediction
SMART_READER_LITE
LIVE PREVIEW

Multimodal Machine Translation with Embedding Prediction Tosho - - PowerPoint PPT Presentation

Multimodal Machine Translation with Embedding Prediction Tosho Hirasawa, Hayahide Yamagishi, Yukio Matsumura, Mamoru Komachi hirasawa-tosho@ed.tmu.ac.jp Tokyo Metropolitan University NAACL SRW 2019 Multimodal Machine Translation Practical


slide-1
SLIDE 1

Multimodal Machine Translation with Embedding Prediction

Tosho Hirasawa, Hayahide Yamagishi, Yukio Matsumura, Mamoru Komachi hirasawa-tosho@ed.tmu.ac.jp Tokyo Metropolitan University NAACL SRW 2019

slide-2
SLIDE 2

Multimodal Machine Translation

  • Practical application of machine translation
  • Translate a source sentence along with related nonlinguistic

information

  • Visual information

6/11/19 NAACL SRW 2019, Minneapolis

1

two young girls are sitting on the street eating corn . deux jeunes filles sont assises dans la rue , mangeant du maïs .

slide-3
SLIDE 3

Issue of MMT

  • Multi30k [Elliott et al., 2016] has only small mount of data
  • Statistic of training data
  • Hard to train rare word translation
  • Tend to output synonyms guided by language model

6/11/19 NAACL SRW 2019, Minneapolis

2

Sentences Tokens Types English 29,000 377,534 10,210 French 409,845 11,219 Source deux jeunes filles sont assises dans la rue , mangeant du maïs . Reference two young girls are sitting on the street eating corn . NMT two young girls are sitting on the street eating food .

slide-4
SLIDE 4

Previous Solutions

  • Parallel corpus without images [Elliott and Kádár, 2017; Grönroos et al., 2018]
  • Out-of-domain data
  • Pseudo in-domain data by filtering general domain data
  • Pseudo-parallel corpus [Sennrich et al., 2016; Helcl et al., 2018]
  • Back-translation of caption/monolingual data
  • Monolingual data
  • Pretrained Word Embedding
  • Seldomly studied

6/11/19 NAACL SRW 2019, Minneapolis

3

slide-5
SLIDE 5

Motivation

  • Introduce pretrained word embedding to MMT
  • Improve rare word translation in MMT
  • Pretrained word embeddings with conventional MMT?
  • See our paper on MT Summit 2019 (https://arxiv.org/abs/1905.10464) !
  • Pretrained Word Embedding in text-only NMT
  • Initialize embedding layers in encoder/decoder [Qi et al., 2018]

üImprove overall performance in low-resource domain

  • Search-based decoder with continuous output [Kumar and Tsvetkov, 2019]

üImprove rare word translation

6/11/19 NAACL SRW 2019, Minneapolis

4

slide-6
SLIDE 6
  • 1. Multimodal Machine Translation
  • 2. MMT with Embedding Prediction
  • 3. Pretrained Word Embedding
  • 4. Result & Conclusion

6/11/19 NAACL SRW 2019, Minneapolis

5

slide-7
SLIDE 7

Baseline: IMAGINATION [Elliot and Kádáar, 2017]

6/11/19 NAACL SRW 2019, Minneapolis

6

While validating, testing While training Multitask Learning: Train both MT task and shared space learning task to improve the shared encoder. MT Model: Bahdanau et al., 2015

slide-8
SLIDE 8

MMT with Embedding Prediction

6/11/19 NAACL SRW 2019, Minneapolis

7

While validating, testing While training

  • 1. Use embedding prediction

in decoder

  • 3. Shift visual features to make

the mean vector be a zero

  • 2. Initialize embedding layers

in encoder/decoder with pretrained word embeddings

slide-9
SLIDE 9

Embedding Prediction (Continuous Output)

6/11/19 NAACL SRW 2019, Minneapolis

8

  • i.e. Continuous Output [Kumar and Tsvetkov, 2019]
  • Predict a word embedding and search for the nearest word
  • 1. Predict a word embedding of

next word.

  • 2. Compute cosine similarities

with each word in pretrained word embedding.

  • 3. Find and output the most

similar word as system output. Keep unchanged: Pretrained word embedding will NOT be updated during training. 3 2 1

slide-10
SLIDE 10

Embedding Layer Initialization

  • Initialize embedding layer with pretrained word embedding
  • Fine-tune the embedding layer in encoder
  • DO NOT update the embedding layer in decoder

6/11/19 NAACL SRW 2019, Minneapolis

9

Fine Tune Unchanged [Qi et al., 2018]

slide-11
SLIDE 11

Loss Function

  • Model loss: Interpolation of each loss [Elliot and Kádáar, 2017]
  • MT task: Max-margin with negative sampling [Lazaridou et al., 2015]
  • negative sampling
  • Shared space learning task: Max-margin [Elliot and Kádáar, 2017]

6/11/19 NAACL SRW 2019, Minneapolis

10

<latexit sha1_base64="Y1TJQZNz6khHVoTzkNhLmVfILGs=">AC2XicjVFNbxMxEPVu+SjhoykcuVgEpESk0W6pVC5IFVwQElKRmrZSNl3NOk7WyXq9smeByPKBG+LKmR/Gv8Gb5tC0PTCS5ef3mhGz1lVCINR9DcIt+7cvXd/+0Hr4aPHT3bau09Pjao140OmCqXPMzC8ECUfosCn1eag8wKfpYtPjT62VeujVDlCS4rPpYwK8VUMEBPpe0/n9JEAuZa2hPXTDnCH2aVLm4wvfoO5qYWqbzi8/U09pYiPvmoGUQF/TSTfJpE1yQMtdOndeat7rfu5c91tq5+7C7rlej+79h3+Zzr2zlbY70SBaFb0J4jXokHUdp7vBy2SiWC15iawAY0ZxVOHYgkbBCu5aSW14BWwBMz7ysATJzdiucnT0lWcmdKq0PyXSFXu1w4I0Zikz72x2Nde1hrxNG9U4fTu2oqxq5CW7HDStC4qKNp9CJ0JzhsXSA2Ba+F0py0EDQ/91G1OYkqspt63V93ezuGkA5rKfyU1fptQCITPO5xpfT/EmON0fxNEg/nLQOXq/TnibPCcvSJfE5JAckY/kmAwJC7aCXrAfvAlH4Y/wZ/jr0hoG65nZKPC3/8AK7LkIw=</latexit> <latexit sha1_base64="oJs4glWS4qBuFUzG605uRQdJnvU=">ADHicjVFNixMxGM6MX7V+tevRS7AIXeiWmUXQi7DoxeMKtrvQ1CGTpm3afAxJxlpi7v4Kf4038Sp49ZeYmY5gt3vwhZBnvd5n3JkxecGZskv6L4xs1bt+07rbv3X/w8FGnezQ2qtSEjojiSl/m2FDOJB1Zjm9LDTFIuf0Il+/qfoXH6k2TMn3dlvQqcALyeaMYBuorPN5k7mV/+BOPHwFEdYLgT9lbgMRkxAJbJcEczf2HiIHZ32UC4eW2Drqs5UfwPq7UmkRKN/fHB9DePIfwm2CtKs0uGSV3wEKQN6IGmzrNu9AXNFCkFlZwbMwkTQo7dVhbRj1bVQaWmCyxgs6CVBiQc3U1c/k4bPAzOBc6XCkhTX7r8NhYcxW5EFZrWqu9iryut6ktPOXU8dkUVoqyW7QvOTQKli9OZwxTYnl2wAw0SzsCskSa0xsSGZvClGinLdWoNwV4ubCtilGORiX5crtbY4N7dRpJu6n/JmWtS9bsAVOGQFrDhEGeC2eA4MDB5aAjcX0OVXo1qUMwPh2myTB97x39rpJsQWegKegD1LwApyBt+AcjABv6NW1I2O4q/xt/h7/GMnjaPG8xjsVfzD3W5A3I=</latexit> <latexit sha1_base64="mcKAhji/xZOUN+j4d5fjAQJCiQ=">ADLHicbVHLihNBFK1uX2N8ZXQjuCkniDOYCd0i6EYdCOuRjCZgVQI1ZVKukg92qrbo6HovQu/xa9xI+LWD/ALrO5pYTLJhaZPnXNP3UudrJDCQZL8jOIrV69dv7Fzs3Pr9p2797q790fOlJbxITPS2NOMOi6F5kMQIPlpYTlVmeQn2fJtrZ+ceuE0R9hVfCJogst5oJRCNS0+39lCgKuV+VO0TyDnQPilycYE+wK8xcaWaepIpf/a0wkTzT7g5VOGg6BdMPE76mFBZ5BQ/w7P9WiU5hbqlj1vjAT7cLgWFVHja7SWDpCm8CdIW9FBbx9Pd6CuZGVYqroFJ6tw4TQqYeGpBMmrDikdLyhb0gUfB6ip4m7im2er8JPAzPDc2PBpwA170eGpcm6lstBZv4W7rNXkNm1cwvzVxAtdlMA1Ox80LyUGg+sM8ExYzkCuAqDMirArZjm1lEFIam0KM6qZsm2tfvjXi7saQK76mVrvy4xZAs1c1emEwD43d+mZJ9QuQmSVbxI2hSdW4ZYjUigBwbFhEHrTELj/hjq79HJSm2D0fJAmg/TDi97RmzbFHfQI7aF9lKX6Ai9Q8doiBj6Gz2MHkd78f4R/wr/n3eGket5wFaq/jP9RgCJk=</latexit> <latexit sha1_base64="XQxfQrA162bkpg+QKWvn5A4Rd6c=">ADBXicbZHNbtNAEMc3Lh8lfDSFI5cVEVIqQmQjJLgVXBPRWpStlo2i93sSr7Ie1O26JLJ/Ly3BDXHkAnoDH4NpeWLuaJqMZPmv/8xvZzQTZ1I4CM/rWDrzt1797cftB8+evxkp7P7dORMbhkfMiONPYmp41JoPgQBkp9klMVS34cLz5V+eNTbp0w+giWGZ8oOtdiJhgFb07/AB/wER6IKH4YEoUhdSq4qjsEUg50D7JUnHD3sOvcC/Cr6+Zvf/MaDMz8sy0w0HYR14XUSN6KImDqe7rW8kMSxXAOT1LlxFGYwKagFwSQv2yR3PKNsQed87KWmirtJUe+jxC+9k+CZsf7TgGv3JlFQ5dxSxb6yGtLdzlXmptw4h9n7SF0lgPX7KrRLJcYDK6WixNhOQO59IyK/ysmKXUgb+BCtdmF1l01j9f2/GtxVAlLVj9VqXWzMAmjsynabaH5Wv6WTglA7V/RrWdSrN1lBrMKNR6RQAjyxBgi9DnjvGqhuF92+1LoYvRlE4SD68ra7/7G54jZ6jl6gHorQO7SPqNDNEQM/UZ/0QW6DM6D78GP4OdVadBqmGdoJYJf/wCZqPu9</latexit>
slide-12
SLIDE 12
  • 1. Multimodal Machine Translation
  • 2. MMT with Embedding Prediction
  • 3. Pretrained Word Embedding
  • 4. Result & Conclusion

6/11/19 NAACL SRW 2019, Minneapolis

11

slide-13
SLIDE 13

Hubness Problem [Lazaridou et al., 2015]

  • Certain words (hubs) appear frequently in the neighbors of
  • ther words
  • Even of the word that has entirely no relationship with hubs
  • Prevent the embedding prediction model from searching for

correct output words

  • Incorrectly output the hub word

6/11/19 NAACL SRW 2019, Minneapolis

12

slide-14
SLIDE 14

All-but-the-Top [Mu and Viswanath, 2018]

  • Address hubness problem in other NLP tasks
  • Debias a pretrained word embedding based on its global bias
  • 1. Shift all word embeddings to make their mean vector into a zero

vector

  • 2. Subtract top 5 PCA components from each shifted word embedding
  • Applied to pretrained word embeddings for encoder/decoder

6/11/19 NAACL SRW 2019, Minneapolis

13

slide-15
SLIDE 15
  • 1. Multimodal Machine Translation
  • 2. MMT with Embedding Prediction
  • 3. Pretrained Word Embedding
  • 4. Result & Conclusion

6/11/19 NAACL SRW 2019, Minneapolis

14

slide-16
SLIDE 16

Implementation & Dataset

  • Implementation
  • Based on nmtpytorch v3.0.0 [Caglayan et al., 2017]
  • Dataset
  • Multi30k (French to English)
  • Pretrained ResNet50 for visual encoder
  • Pretrained Word Embedding
  • FastText
  • Trained on Common Crawl and Wikipedia
  • https://fasttext.cc/docs/en/crawl-vectors.html

6/11/19 NAACL SRW 2019, Minneapolis

15

Our code is here: https://github.com/toshohirasawa/nmtpytorch-emb-pred

slide-17
SLIDE 17

Hyper Parameters

  • Model
  • dimension of hidden state: 256
  • RNN type: GRU
  • dimension of word embedding: 300
  • dimension of shared space: 2048
  • Vocabulary size (French, English): 10,000
  • Training
  • λ = 0.99
  • Optimizer: Adam
  • Learning rate: 0.0004
  • Dropout rate: 0.3

6/11/19 NAACL SRW 2019, Minneapolis

16

slide-18
SLIDE 18

Word-level F1-score

6/11/19 NAACL SRW 2019, Minneapolis

17

5.48 12.46 19.97 24.65 32.44 49.66 69.66 5.63 12.86 16.76 22.74 33.64 51.12 69.98 13.59 19.77 28.34 33.64 38.03 52.13 71.24

10 20 30 40 50 60 70 80 1 2 3 4 5 - 9 10 - 99 100+

F-score of word Frequency in training data

Bahdanau et al., 2015 IMAGINATION Ours

Rare words

slide-19
SLIDE 19

Ablation w.r.t. Embedding Layers

  • Fixing the embedding layer in decoder is essential
  • Keep word embeddings in input/output layers consistent

6/11/19 NAACL SRW 2019, Minneapolis

18

Encoder Decoder Fixed

BLEU METEOR

FastText FastText Yes 53.49 43.89 random FastText Yes 53.22 43.83 FastText random No 51.53 43.07 random random No 51.42 42.77 FastText FastText No 51.42 42.88 random FastText No 50.72 42.52 Encoder/Decoder: Initialize embedding layer with random values or FastText word embedding. Fixed (Yes/No): Whether fix the embedding layer in decoder or fine-tune that while training.

slide-20
SLIDE 20

Overall Performance

  • Our model performs better than baselines
  • Even those with embedding layer initialization

6/11/19 NAACL SRW 2019, Minneapolis

19

Model Validation Test BLEU BLEU METEOR Bahdanau et al. 2015 50.83 51.00 .37 42.65 .12 + pretrained 52.05 52.33 .66 43.42 .13 IMAGINATION 51.03 51.18 .16 42.80 .19 + pretrained 52.40 52.75 .25 43.56 .04 Ours 53.14 53.49 .20 43.89 .14 Model (+ pretrained): Apply embedding layer initialization and All-but-the-Top debiasing.

slide-21
SLIDE 21

Ablation w.r.t. Visual Features

  • Centering visual features is required to train our model

6/11/19 NAACL SRW 2019, Minneapolis

20

Visual Features Validation Test BLEU BLEU METEOR Centered 53.14 53.49 43.89 Raw 52.65 53.27 43.91 No 52.97 53.25 43.91 Visual Features (Centered/Raw/No): Use centered visual features or raw visual features to train model. ’’No’’ show the result of text-only NMT with embedding prediction model.

slide-22
SLIDE 22

Conclusion & Future Works

  • MMT with embedding prediction improves ...
  • Rare word translation
  • Overall performance
  • It is essential for embedding prediction model to ...
  • Fix the embedding in decoder
  • Debias the pretrained word embedding
  • Center the visual feature for multitask learning
  • Future works
  • Better training corpora for embedding learning in MMT domain
  • Incorporate visual features into contextualized word embeddings

6/11/19 NAACL SRW 2019, Minneapolis

21 Thank you!

slide-23
SLIDE 23

6/11/19 NAACL SRW 2019, Minneapolis

22

slide-24
SLIDE 24

Translation Example

un homme en vélo pédale devant une voûte . a man on a bicycle pedals through an archway . a man on a bicycle pedal past an arch . a man on a bicycle pedals outside a monument . a man on a bicycle pedals in front of a archway .

6/11/19 NAACL SRW 2019, Minneapolis

23

Source Reference Text-only NMT IMAGINATION Ours

slide-25
SLIDE 25

Translation Example (long)

quatre hommes , dont trois portent des kippas , sont assis sur un tapis à motifs bleu et vert olive . four men , three of whom are wearing prayer caps , are sitting on a blue and olive green patterned mat . four men , three of whom are wearing aprons , are sitting on a blue and green speedo carpet . four men , three of them are wearing alaska , are sitting on a blue patterned carpet and green green seating . four men , three are wearing these are wearing these are sitting on a blue and green patterned mat .

6/11/19 NAACL SRW 2019, Minneapolis

24

Source Reference Text-only NMT IMAGINATION Ours