An Introduction to Neural Machine Translation Prof. John D. Kelleher - PowerPoint PPT Presentation

An Introduction to Neural Machine Translation Prof. John D. Kelleher @johndkelleher ADAPT Centre for Digital Content Technology Dublin Institute of Technology, Ireland June 25, 2018 1 / 57

Outline The Neural Machine Translation Revolution Neural Networks 101 Word Embeddings Language Models Neural Language Models Neural Machine Translation Beyond NMT: Image Annotation

Image from https://blogs.msdn.microsoft.com/translation/

Image from https://www.blog.google/products/translate

Image from https://techcrunch.com/2017/08/03/facebook-finishes-its-move-to-neural-machine-translation/

Image from https: //slator.com/technology/linguees-founder-launches-deepl-attempt-challenge-google-translate/

Neural Networks 101 7 / 57

What is a function? A function maps a set of inputs (numbers) to an output (number) 1 sum (2 , 5 , 4) → 11 1 This introduction to neural network and machine translation is based on: Kelleher (2016) 8 / 57

What is a weightedSum function? weightedSum ([ x 1 , x 2 , . . . , x m ] , [ w 1 , w 2 , . . . , w m ] ) � �� Input Numbers Weights = ( x 1 × w 1 ) + ( x 2 × w 2 ) + · · · + ( x m × w m ) weightedSum ([3 , 9] , [ − 3 , 1]) = (3 × − 3) + (9 × 1) = − 9 + 9 = 0 9 / 57

What is an activation function? An activation function takes the output of our weightedSum function and applies another mapping to it. 10 / 57

What is an activation function? 1.0 1.0 0.5 0.5 1 activation(z) logistic ( z ) = activation(z) 1 + e − z rectifier(z) = max(0,z) 0.0 0.0 −0.5 −0.5 tanh ( z ) = e z − e − z e z + e − z logistic(z) −1.0 tanh(z) −1.0 −10 −5 0 5 10 −1.0 −0.5 0.0 0.5 1.0 z z 11 / 57

What is an activation function? activation = logistic ( weightedSum (([ x 1 , x 2 , . . . , x m ] , [ w 1 , w 2 , . . . , w m ] )) � �� Input Numbers Weights logistic ( weightedSum ([3 , 9] , [ − 3 , 1])) = logistic ((3 × − 3) + (9 × 1)) = logistic ( − 9 + 9) = logistic (0) = 0 . 5 12 / 57

What is a Neuron ? The simple list of operations that we have just described defines the fundamental building block of a neural network: the Neuron . Neuron = activation ( weightedSum (([ x 1 , x 2 , . . . , x m ] , [ w 1 , w 2 , . . . , w m ] )) � �� Input Numbers Weights 13 / 57

What is a Neuron ? x 0 w 0 w 1 x 1 w 2 Activation Σ ϕ x 2 w 3 x 3 . . w m . x m 14 / 57

What is a Neural Network ? Input Hidden Hidden Hidden Output Layer Layer 1 Layer 2 Layer 3 Layer 4 15 / 57

Training a Neural Network ◮ We train a neural network by iteratively updating the weights ◮ We start by randomly assigning weights to each edge ◮ We then show the network examples of inputs and expected outputs and update the weights using Backpropogation so that the network outputs match the expected outputs ◮ We keep updating the weights until the network is working the way we want 16 / 57

Word Embeddings 17 / 57

Word Embeddings ◮ Language is sequential and has lots of words. 18 / 57

“a word is characteriezed by the company it keeps” — Firth, 1957 19 / 57

Word Embeddings 1. Train a network to predict the word that is missing from the middle of an n-gram (or predict the n-gram from the word) 2. Use the trained network weights to represent the word in vector space. 20 / 57

Word Embeddings Each word is represented by a vector of numbers that positions the word in a multi-dimensional space, e.g.: king = < 55 , − 10 , 176 , 27 > man = < 10 , 79 , 150 , 83 > woman = < 15 , 74 , 159 , 106 > queen = < 60 , − 15 , 185 , 50 > 21 / 57

Word Embeddings vec ( King ) − vec ( Man ) + vec ( Woman ) ≈ vec ( Queen ) 2 2 Linguistic Regularities in Continuous Space Word Representations Mikolov et al. (2013) 22 / 57

Language Models 23 / 57

Language Models ◮ Language is sequential and has lots of words. 24 / 57

1,2,? 25 / 57

0 . 2 0 . 18 0 . 16 0 . 14 0 . 12 0 . 1 8 · 10 − 2 0 1 2 3 4 5 6 7 8 9

Th? 27 / 57

0 . 4 0 . 3 0 . 2 0 . 1 0 a b c d e f g h i j k l m n o p q r s t u v w x y z

◮ A language model can compute: 1. the probability of an upcoming symbol: P ( w n | w 1 , . . . , w n − 1 ) 2. the probability for a sequence of symbols 3 P ( w 1 , . . . , w n ) 3 We can go from 1. to 2. using the Chain Rule of Probability P ( w 1 , w 2 , w 3) = P ( w 1 ) P ( w 2 | w 1 ) P ( w 3 | w 1 , w 2 ) 29 / 57

◮ Language models are useful for machine translation because they help with: 1. word ordering P ( Yes I can help you ) > P ( Help you I can yes ) 4 2. word choice P ( Feel the Force ) > P ( Eat the Force ) 4 Unless its Yoda that speaking 30 / 57

Neural Language Models 31 / 57

Recurrent Neural Networks A particular type of neural network that is useful for processing sequential data (such as, language) is a Recurrent Neural Network . 32 / 57

Recurrent Neural Networks Using an RNN we process our sequential data one input at a time. In an RNN the outputs of some of the neurons for one input are feed back into the network as part the next input. 33 / 57

Simple Feed-Forward Network Input Layer Hidden Input 1 Layer Input 2 Input 3 Output ... Layer 34 / 57

Recurrent Neural Networks Input Layer Hidden Input 1 Layer Input 2 Buffer Input 3 Output ... Layer 35 / 57

Recurrent Neural Networks Input Layer Input 1 Hidden Input 2 Layer Input 3 Buffer ... Output Layer 38 / 57

Recurrent Neural Networks Input Layer Input 1 Hidden Input 2 Layer Input 3 Buffer ... Output Layer 39 / 57

Recurrent Neural Networks Input Input 1 Layer Input 2 Hidden Input 3 Layer ... Buffer Output Layer 40 / 57

Recurrent Neural Networks Input Input 1 Layer Input 2 Hidden Input 3 Layer ... Buffer Output Layer 41 / 57

y t h t h t = φ (( W hh · h t − 1 ) + ( W xh · x t )) y t = φ ( W hy · h t ) x t h t − 1 Figure: Recurrent Neural Network 42 / 57

Recurrent Neural Networks Output: y 1 y 2 y 3 y t y t +1 · · · h 1 h 2 h 3 h t h t +1 Input: x 1 x 2 x 3 x t x t +1 Figure: RNN Unrolled Through Time 43 / 57

Hallucinating Text Output: ∗ Word 2 ∗ Word 3 ∗ Word 4 ∗ Word t +1 · · · h 1 h 2 h 3 h t · · · Input: Word 1 44 / 57

Hallucinating Shakespeare PANDARUS: Alas, I think he shall be come approached and the day When little srain would be attain’d into being never fed, And who is but a chain and subjects of his death, I should not sleep. Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states. DUKE VINCENTIO: Well, your wit is in the care of side and that. From: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 45 / 57

Neural Machine Translation 46 / 57

Neural Machine Translation 1. RNN Encoders 2. RNN Language Models 47 / 57

Encoders Encoding: h 1 h 2 h m C · · · Input: Word 1 Word 2 Word m < eos > Figure: Using an RNN to Generate an Encoding of a Word Sequence 48 / 57

Language Models Output: ∗ Word 2 ∗ Word 3 ∗ Word 4 ∗ Word t +1 h 1 h 2 h 3 h t · · · Input: Word 1 Word 2 Word 3 Word t Figure: RNN Language Model Unrolled Through Time 49 / 57

Decoder Output: ∗ Word 2 ∗ Word 3 ∗ Word 4 ∗ Word t +1 · · · h 1 h 2 h 3 h t · · · Input: Word 1 Figure: Using an RNN Language Model to Generate (Hallucinate) a Word Sequence 50 / 57

Encoder-Decoder Architecture Target 1 Target 2 < eos > · · · Encoder h 1 h 2 C d 1 d n · · · · · · Decoder Source 1 Source 2 < eos > · · · Figure: Sequence to Sequence Translation using an Encoder-Decoder Architecture 51 / 57

Neural Machine Translation Life is beautiful < eos > Encoder h 1 h 2 h 3 h 4 C d 1 d 2 d 3 Decoder belle est vie La < eos > Figure: Example Translation using an Encoder-Decoder Architecture 52 / 57

Beyond NMT: Image Annotation 53 / 57

Image Annotation Image from Image from Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Xu et al. (2015) 54 / 57

Thank you for your attention john.d.kelleher@dit.ie @johndkelleher www.machinelearningbook.com https://ie.linkedin.com/in/johndkelleher DATA SCIENCE JOHN D. KELLEHER AND BRENDAN TIERNEY THE MIT PRESS ESSENTIAL KNOWLEDGE SERIES Acknowledgements: The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. Books: Kelleher et al. (2015) Kelleher and Tierney (2018)

An Introduction to Neural Machine Translation Prof. John D. Kelleher - PowerPoint PPT Presentation

An Introduction to Neural Machine Translation Prof. John D. Kelleher @johndkelleher ADAPT Centre for Digital Content Technology Dublin Institute of Technology, Ireland June 25, 2018 1 / 57 Outline The Neural Machine Translation Revolution

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Convolutional over Recurrent Encoder for Neural Machine Translation Praveen Dakwale and Christof

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

The use of parallel corpora in linguistics Annemarie Verkerk Translation: Online and offline,

Our vision To be our clients strategic partner of choice and inspire our people to be the

ATF.gov en espaol Lessons and successes in site translation Mayela Jackson and NeKeisha

Program of the module Week Speaker Content 10/02- Herv Blanchon Team, Outline, Introduction

Preliminary Findings of the Vision Group Translation and Localisation Josef van Genabith Centre

An approach to unsupervised historical text normalisation Petar Mitankin Stefan Gerdjikov

Attention a useful tool to improve and understand neural networks Sala Riunioni DISI V.le

PI World Gothenburg 2019 Presentation Content Guidelines OSIsoft PI World presents a unique