Multimodal Machine Translation with Embedding Prediction Tosho - PowerPoint PPT Presentation

Multimodal Machine Translation with Embedding Prediction Tosho Hirasawa, Hayahide Yamagishi, Yukio Matsumura, Mamoru Komachi hirasawa-tosho@ed.tmu.ac.jp Tokyo Metropolitan University NAACL SRW 2019

Multimodal Machine Translation • Practical application of machine translation • Translate a source sentence along with related nonlinguistic information • Visual information two young girls are sitting on the street eating corn . deux jeunes filles sont assises dans la rue , mangeant du maïs . 1 6/11/19 NAACL SRW 2019, Minneapolis

Issue of MMT • Multi30k [Elliott et al., 2016] has only small mount of data • Statistic of training data Sentences Tokens Types English 377,534 10,210 29,000 French 409,845 11,219 • Hard to train rare word translation • Tend to output synonyms guided by language model Source deux jeunes filles sont assises dans la rue , mangeant du maïs . Reference two young girls are sitting on the street eating corn . NMT two young girls are sitting on the street eating food . 2 6/11/19 NAACL SRW 2019, Minneapolis

Previous Solutions • Parallel corpus without images [Elliott and Kádár, 2017; Grönroos et al., 2018] • Out-of-domain data • Pseudo in-domain data by filtering general domain data • Pseudo-parallel corpus [Sennrich et al., 2016; Helcl et al., 2018] • Back-translation of caption/monolingual data • Monolingual data • Pretrained Word Embedding • Seldomly studied 3 6/11/19 NAACL SRW 2019, Minneapolis

Motivation • Introduce pretrained word embedding to MMT • Improve rare word translation in MMT • Pretrained word embeddings with conventional MMT? • See our paper on MT Summit 2019 (https://arxiv.org/abs/1905.10464) ! • Pretrained Word Embedding in text-only NMT • Initialize embedding layers in encoder/decoder [Qi et al., 2018] ü Improve overall performance in low-resource domain • Search-based decoder with continuous output [Kumar and Tsvetkov, 2019] ü Improve rare word translation 4 6/11/19 NAACL SRW 2019, Minneapolis

1. Multimodal Machine Translation 2. MMT with Embedding Prediction 3. Pretrained Word Embedding 4. Result & Conclusion 5 6/11/19 NAACL SRW 2019, Minneapolis

Baseline: IMAGINATION [Elliot and Kádáar, 2017] MT Model: Bahdanau et al., 2015 While validating, testing Multitask Learning: Train both MT task and shared space learning task to improve the shared encoder. While training 6 6/11/19 NAACL SRW 2019, Minneapolis

MMT with Embedding Prediction 1. Use embedding prediction in decoder While validating, testing 2. Initialize embedding layers in encoder/decoder with pretrained word embeddings 3. Shift visual features to make While training the mean vector be a zero 7 6/11/19 NAACL SRW 2019, Minneapolis

Embedding Prediction (Continuous Output) • i.e. Continuous Output [Kumar and Tsvetkov, 2019] • Predict a word embedding and search for the nearest word 1. Predict a word embedding of next word. 3 2. Compute cosine similarities with each word in pretrained 2 word embedding. 1 3. Find and output the most similar word as system output. Keep unchanged: Pretrained word embedding will NOT be updated during training. 8 6/11/19 NAACL SRW 2019, Minneapolis

Embedding Layer Initialization [Qi et al., 2018] • Initialize embedding layer with pretrained word embedding • Fine-tune the embedding layer in encoder • DO NOT update the embedding layer in decoder Fine Tune Unchanged 9 6/11/19 NAACL SRW 2019, Minneapolis

<latexit sha1_base64="Y1TJQZNz6khHVoTzkNhLmVfILGs=">AC2XicjVFNbxMxEPVu+SjhoykcuVgEpESk0W6pVC5IFVwQElKRmrZSNl3NOk7WyXq9smeByPKBG+LKmR/Gv8Gb5tC0PTCS5ef3mhGz1lVCINR9DcIt+7cvXd/+0Hr4aPHT3bau09Pjao140OmCqXPMzC8ECUfosCn1eag8wKfpYtPjT62VeujVDlCS4rPpYwK8VUMEBPpe0/n9JEAuZa2hPXTDnCH2aVLm4wvfoO5qYWqbzi8/U09pYiPvmoGUQF/TSTfJpE1yQMtdOndeat7rfu5c91tq5+7C7rlej+79h3+Zzr2zlbY70SBaFb0J4jXokHUdp7vBy2SiWC15iawAY0ZxVOHYgkbBCu5aSW14BWwBMz7ysATJzdiucnT0lWcmdKq0PyXSFXu1w4I0Zikz72x2Nde1hrxNG9U4fTu2oqxq5CW7HDStC4qKNp9CJ0JzhsXSA2Ba+F0py0EDQ/91G1OYkqspt63V93ezuGkA5rKfyU1fptQCITPO5xpfT/EmON0fxNEg/nLQOXq/TnibPCcvSJfE5JAckY/kmAwJC7aCXrAfvAlH4Y/wZ/jr0hoG65nZKPC3/8AK7LkIw=</latexit> <latexit sha1_base64="oJs4glWS4qBuFUzG605uRQdJnvU=">ADHicjVFNixMxGM6MX7V+tevRS7AIXeiWmUXQi7DoxeMKtrvQ1CGTpm3afAxJxlpi7v4Kf4038Sp49ZeYmY5gt3vwhZBnvd5n3JkxecGZskv6L4xs1bt+07rbv3X/w8FGnezQ2qtSEjojiSl/m2FDOJB1Zjm9LDTFIuf0Il+/qfoXH6k2TMn3dlvQqcALyeaMYBuorPN5k7mV/+BOPHwFEdYLgT9lbgMRkxAJbJcEczf2HiIHZ32UC4eW2Drqs5UfwPq7UmkRKN/fHB9DePIfwm2CtKs0uGSV3wEKQN6IGmzrNu9AXNFCkFlZwbMwkTQo7dVhbRj1bVQaWmCyxgs6CVBiQc3U1c/k4bPAzOBc6XCkhTX7r8NhYcxW5EFZrWqu9iryut6ktPOXU8dkUVoqyW7QvOTQKli9OZwxTYnl2wAw0SzsCskSa0xsSGZvClGinLdWoNwV4ubCtilGORiX5crtbY4N7dRpJu6n/JmWtS9bsAVOGQFrDhEGeC2eA4MDB5aAjcX0OVXo1qUMwPh2myTB97x39rpJsQWegKegD1LwApyBt+AcjABv6NW1I2O4q/xt/h7/GMnjaPG8xjsVfzD3W5A3I=</latexit> <latexit sha1_base64="mcKAhji/xZOUN+j4d5fjAQJCiQ=">ADLHicbVHLihNBFK1uX2N8ZXQjuCkniDOYCd0i6EYdCOuRjCZgVQI1ZVKukg92qrbo6HovQu/xa9xI+LWD/ALrO5pYTLJhaZPnXNP3UudrJDCQZL8jOIrV69dv7Fzs3Pr9p2797q790fOlJbxITPS2NOMOi6F5kMQIPlpYTlVmeQn2fJtrZ+ceuE0R9hVfCJogst5oJRCNS0+39lCgKuV+VO0TyDnQPilycYE+wK8xcaWaepIpf/a0wkTzT7g5VOGg6BdMPE76mFBZ5BQ/w7P9WiU5hbqlj1vjAT7cLgWFVHja7SWDpCm8CdIW9FBbx9Pd6CuZGVYqroFJ6tw4TQqYeGpBMmrDikdLyhb0gUfB6ip4m7im2er8JPAzPDc2PBpwA170eGpcm6lstBZv4W7rNXkNm1cwvzVxAtdlMA1Ox80LyUGg+sM8ExYzkCuAqDMirArZjm1lEFIam0KM6qZsm2tfvjXi7saQK76mVrvy4xZAs1c1emEwD43d+mZJ9QuQmSVbxI2hSdW4ZYjUigBwbFhEHrTELj/hjq79HJSm2D0fJAmg/TDi97RmzbFHfQI7aF9lKX6Ai9Q8doiBj6Gz2MHkd78f4R/wr/n3eGket5wFaq/jP9RgCJk=</latexit> <latexit sha1_base64="XQxfQrA162bkpg+QKWvn5A4Rd6c=">ADBXicbZHNbtNAEMc3Lh8lfDSFI5cVEVIqQmQjJLgVXBPRWpStlo2i93sSr7Ie1O26JLJ/Ly3BDXHkAnoDH4NpeWLuaJqMZPmv/8xvZzQTZ1I4CM/rWDrzt1797cftB8+evxkp7P7dORMbhkfMiONPYmp41JoPgQBkp9klMVS34cLz5V+eNTbp0w+giWGZ8oOtdiJhgFb07/AB/wER6IKH4YEoUhdSq4qjsEUg50D7JUnHD3sOvcC/Cr6+Zvf/MaDMz8sy0w0HYR14XUSN6KImDqe7rW8kMSxXAOT1LlxFGYwKagFwSQv2yR3PKNsQed87KWmirtJUe+jxC+9k+CZsf7TgGv3JlFQ5dxSxb6yGtLdzlXmptw4h9n7SF0lgPX7KrRLJcYDK6WixNhOQO59IyK/ysmKXUgb+BCtdmF1l01j9f2/GtxVAlLVj9VqXWzMAmjsynabaH5Wv6WTglA7V/RrWdSrN1lBrMKNR6RQAjyxBgi9DnjvGqhuF92+1LoYvRlE4SD68ra7/7G54jZ6jl6gHorQO7SPqNDNEQM/UZ/0QW6DM6D78GP4OdVadBqmGdoJYJf/wCZqPu9</latexit> Loss Function • Model loss: Interpolation of each loss [Elliot and Kádáar, 2017] • MT task: Max-margin with negative sampling [Lazaridou et al., 2015] • negative sampling • Shared space learning task: Max-margin [Elliot and Kádáar, 2017] 10 6/11/19 NAACL SRW 2019, Minneapolis

Hubness Problem [Lazaridou et al., 2015] • Certain words (hubs) appear frequently in the neighbors of other words • Even of the word that has entirely no relationship with hubs • Prevent the embedding prediction model from searching for correct output words • Incorrectly output the hub word 12 6/11/19 NAACL SRW 2019, Minneapolis

All-but-the-Top [Mu and Viswanath, 2018] • Address hubness problem in other NLP tasks • Debias a pretrained word embedding based on its global bias 1. Shift all word embeddings to make their mean vector into a zero vector 2. Subtract top 5 PCA components from each shifted word embedding • Applied to pretrained word embeddings for encoder/decoder 13 6/11/19 NAACL SRW 2019, Minneapolis

Implementation & Dataset • Implementation • Based on nmtpytorch v3.0.0 [Caglayan et al., 2017] • Dataset • Multi30k (French to English) • Pretrained ResNet50 for visual encoder • Pretrained Word Embedding • FastText • Trained on Common Crawl and Wikipedia • https://fasttext.cc/docs/en/crawl-vectors.html Our code is here: https://github.com/toshohirasawa/nmtpytorch-emb-pred 15 6/11/19 NAACL SRW 2019, Minneapolis

Hyper Parameters • Model • dimension of hidden state: 256 • RNN type: GRU • dimension of word embedding: 300 • dimension of shared space: 2048 • Vocabulary size (French, English): 10,000 • Training • λ = 0.99 • Optimizer: Adam • Learning rate: 0.0004 • Dropout rate: 0.3 16 6/11/19 NAACL SRW 2019, Minneapolis

Word-level F 1 -score 80 69.66 69.98 71.24 Bahdanau et al., 2015 70 IMAGINATION 52.13 51.12 49.66 60 Ours F-score of word 38.03 50 33.64 33.64 Rare words 32.44 28.34 40 24.65 22.74 19.97 19.77 30 16.76 13.59 12.86 12.46 20 5.63 5.48 10 0 1 2 3 4 5 - 9 10 - 99 100+ Frequency in training data 17 6/11/19 NAACL SRW 2019, Minneapolis

Ablation w.r.t. Embedding Layers Encoder Decoder Fixed BLEU METEOR FastText FastText Yes 53.49 43.89 random FastText Yes 53.22 43.83 FastText random No 51.53 43.07 random random No 51.42 42.77 FastText FastText No 51.42 42.88 random FastText No 50.72 42.52 Encoder/Decoder: Initialize embedding layer with random values or FastText word embedding. Fixed (Yes/No): Whether fix the embedding layer in decoder or fine-tune that while training. • Fixing the embedding layer in decoder is essential • Keep word embeddings in input/output layers consistent 18 6/11/19 NAACL SRW 2019, Minneapolis

Multimodal Machine Translation with Embedding Prediction Tosho - PowerPoint PPT Presentation

Multimodal Machine Translation with Embedding Prediction Tosho Hirasawa, Hayahide Yamagishi, Yukio Matsumura, Mamoru Komachi hirasawa-tosho@ed.tmu.ac.jp Tokyo Metropolitan University NAACL SRW 2019 Multimodal Machine Translation Practical

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Probing the Need for Visual Context in Multimodal Machine Translation Ozan Caglayan 1 , Pranava

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

What can Statistical Machine Translation teach Neural Machine Translation about Structured

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

nologique unique. LA PHILOSOPHIE ZALTO CHOYEZ-VOUS SURPRENEZ & choyez vos invits Vos

CATO CBRN crisis management, Architecture, Technologies & Operational procedures 12/03/2014

Robotic assistants may adapt to humans in the factory New algorithm allows robots and humans to

BATTERY SAFETY ORGANIZATION e.V. 17 Oct 2013 / Tegernsee, Germany Ralf Knapp (General

Books Instructions nautiques Yannis Haralambous, Julie Sauvage-Vincent, and John Puentes CNL

Anticipating Post-Closing Environmental Issues in Real Estate Deals Managing and Avoiding

Innovative Approaches to Water Resource Management an Israeli Perspective Dr. Einat Magal

Improving Santa Clara Universitys Waste System and Habits Vanessa Caustrita, Zachary Chien,

Multimodal Machine Translation with Embedding Prediction Tosho - PowerPoint PPT Presentation

Multimodal Machine Translation with Embedding Prediction Tosho Hirasawa, Hayahide Yamagishi, Yukio Matsumura, Mamoru Komachi hirasawa-tosho@ed.tmu.ac.jp Tokyo Metropolitan University NAACL SRW 2019 Multimodal Machine Translation Practical

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Probing the Need for Visual Context in Multimodal Machine Translation Ozan Caglayan 1 , Pranava

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

What can Statistical Machine Translation teach Neural Machine Translation about Structured

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

nologique unique. LA PHILOSOPHIE ZALTO CHOYEZ-VOUS SURPRENEZ &amp; choyez vos invits Vos

CATO CBRN crisis management, Architecture, Technologies &amp; Operational procedures 12/03/2014

Robotic assistants may adapt to humans in the factory New algorithm allows robots and humans to

BATTERY SAFETY ORGANIZATION e.V. 17 Oct 2013 / Tegernsee, Germany Ralf Knapp (General

Books Instructions nautiques Yannis Haralambous, Julie Sauvage-Vincent, and John Puentes CNL

Anticipating Post-Closing Environmental Issues in Real Estate Deals Managing and Avoiding

Innovative Approaches to Water Resource Management an Israeli Perspective Dr. Einat Magal

Improving Santa Clara Universitys Waste System and Habits Vanessa Caustrita, Zachary Chien,

nologique unique. LA PHILOSOPHIE ZALTO CHOYEZ-VOUS SURPRENEZ & choyez vos invits Vos

CATO CBRN crisis management, Architecture, Technologies & Operational procedures 12/03/2014