Neural Machine Translation Philipp Koehn 6 October 2020 Philipp - PowerPoint PPT Presentation

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Language Models 1 • Modeling variants – feed-forward neural network – recurrent neural network – long short term memory neural network • May include input context Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Feed Forward Neural Language Model 2 w i Output Word Softmax h Hidden Layer FF Ew Embedding Embed Embed Embed Embed w i-4 w i-3 w i-2 w i-1 History Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Recurrent Neural Language Model 3 y i Output Word the Output Word t i Softmax Prediction Recurrent h j RNN State Input Word E x j Embed Embedding x j Input Word <s> Predict the first word of a sentence Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Recurrent Neural Language Model 4 y i Output Word the house Output Word t i Softmax Softmax Prediction Recurrent h j RNN RNN State Input Word E x j Embed Embed Embedding x j Input Word <s> the Predict the second word of a sentence Re-use hidden state from first word prediction Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Recurrent Neural Language Model 5 y i Output Word the house is Output Word t i Softmax Softmax Softmax Prediction Recurrent h j RNN RNN RNN State Input Word E x j Embed Embed Embed Embedding x j Input Word <s> the house Predict the third word of a sentence ... and so on Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Recurrent Neural Language Model 6 y i Output Word the house is big . </s> Output Word t i Softmax Softmax Softmax Softmax Softmax Softmax Prediction Recurrent h j RNN RNN RNN RNN RNN RNN State Input Word E x j Embed Embed Embed Embed Embed Embed Embedding x j Input Word <s> the house is big . Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Recurrent Neural Translation Model 7 • We predicted the words of a sentence • Why not also predict their translations? Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Encoder-Decoder Model 8 y i Output Word the house is big . </s> das Haus ist groß . </s> Output Word t i Softmax Softmax Softmax Softmax Softmax Softmax Softmax Softmax Softmax Softmax Softmax Softmax Prediction Recurrent h j RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN State Input Word E x j Embed Embed Embed Embed Embed Embed Embed Embed Embed Embed Embed Embed Embedding x j Input Word <s> the house is big . </s> das Haus ist groß . • Obviously madness • Proposed by Google (Sutskever et al. 2014) Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

What is Missing? 9 • Alignment of input words to output words ⇒ Solution: attention mechanism Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

10 neural translation model with attention Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Input Encoding 11 y i Output Word the house is big . </s> Output Word t i Softmax Softmax Softmax Softmax Softmax Softmax Prediction Recurrent h j RNN RNN RNN RNN RNN RNN State Input Word E x j Embed Embed Embed Embed Embed Embed Embedding x j Input Word <s> the house is big . • Inspiration: recurrent neural network language model on the input side Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Hidden Language Model States 12 • This gives us the hidden states RNN RNN RNN RNN RNN RNN RNN • These encode left context for each word • Same process in reverse: right context for each word RNN RNN RNN RNN RNN RNN RNN Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Input Encoder 13 Right-to-Left h j RNN RNN RNN RNN RNN RNN RNN Encoder Left-to-Right h j RNN RNN RNN RNN RNN RNN RNN Encoder Input Word E x j Embed Embed Embed Embed Embed Embed Embed Embedding x j Input Word <s> the house is big . </s> • Input encoder: concatenate bidrectional RNN states • Each word representation includes full left and right sentence context Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Encoder: Math 14 Right-to-Left h j RNN RNN RNN RNN RNN RNN RNN Encoder Left-to-Right h j RNN RNN RNN RNN RNN RNN RNN Encoder Input Word E x j Embed Embed Embed Embed Embed Embed Embed Embedding x j Input Word <s> the house is big . </s> • Input is sequence of words x j , mapped into embedding space ¯ E x j • Bidirectional recurrent neural networks ← h j = f ( ← − − − h j +1 , ¯ E x j ) − → h j = f ( − − → h j − 1 , ¯ E x j ) • Various choices for the function f () : feed-forward layer, GRU, LSTM, ... Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Decoder 15 • We want to have a recurrent neural network predicting output words Output Word t i Softmax Softmax Softmax Prediction s i Decoder State RNN RNN RNN RNN Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Decoder 16 • We want to have a recurrent neural network predicting output words Output Word E y i Embed Embed Embed Embed Embeddings Output Word t i Softmax Softmax Softmax Prediction s i Decoder State RNN RNN RNN RNN • We feed decisions on output words back into the decoder state Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Decoder 17 • We want to have a recurrent neural network predicting output words Output Word E y i Embed Embed Embed Embed Embeddings Output Word t i Softmax Softmax Softmax Prediction s i Decoder State RNN RNN RNN RNN c i Input Context • We feed decisions on output words back into the decoder state • Decoder state is also informed by the input context Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

More Detail 18 • Decoder is also recurrent neural network over sequence of hidden states s i Output Word E y i Embed Embed Embeddings s i = f ( s i − 1 , Ey − 1 , c i ) y i Output Word • Again, various choices for the function f () : <s> das feed-forward layer, GRU, LSTM, ... Output Word t i Softmax Prediction • Output word y i is selected by computing a vector t i (same size as vocabulary) s i Decoder State RNN RNN t i = W ( Us i − 1 + V Ey i − 1 + Cc i ) c i Input Context then finding the highest value in vector t i • If we normalize t i , we can view it as a probability distribution over words • Ey i is the embedding of the output word y i Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Attention 19 s i Decoder State RNN RNN Input Context α ij Attention Attention Right-to-Left h j RNN RNN RNN RNN RNN Encoder Left-to-Right h j RNN RNN RNN RNN RNN Encoder • Given what we have generated so far (decoder hidden state) • ... which words in the input should we pay attention to (encoder states)? Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Attention 20 s i Decoder State RNN RNN Input Context α ij Attention Attention Right-to-Left h j RNN RNN RNN RNN RNN Encoder Left-to-Right h j RNN RNN RNN RNN RNN Encoder • Given: – the previous hidden state of the decoder s i − 1 – the representation of input words h j = ( ← h j , − − → h j ) • Predict an alignment probability a ( s i − 1 , h j ) to each input word j (modeled with with a feed-forward neural network layer) Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Attention 21 s i Decoder State RNN RNN Input Context α ij Attention Attention Right-to-Left h j RNN RNN RNN RNN RNN Encoder Left-to-Right h j RNN RNN RNN RNN RNN Encoder • Normalize attention (softmax) exp ( a ( s i − 1 , h j )) α ij = � k exp ( a ( s i − 1 , h k )) Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Attention 22 s i Decoder State RNN RNN Weighted c i Input Context Sum α ij Attention Attention Right-to-Left h j RNN RNN RNN RNN RNN Encoder Left-to-Right h j RNN RNN RNN RNN RNN Encoder • Relevant input context: weigh input words according to attention: c i = � j α ij h j Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Attention 23 s i Decoder State RNN RNN RNN Weighted c i Input Context Sum α ij Attention Attention Right-to-Left h j RNN RNN RNN RNN RNN Encoder Left-to-Right h j RNN RNN RNN RNN RNN Encoder • Use context to predict next hidden state and output word Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

24 training Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp - PowerPoint PPT Presentation

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 Language Models 1 Modeling variants feed-forward neural network recurrent neural network long

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Convolutional over Recurrent Encoder for Neural Machine Translation Praveen Dakwale and Christof

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Objectives RNNs are trained only for limited timesteps Can they form long term memories?

A training study to enhance verbal short- term memory performance in individuals with Down

Perfect foresight models St ephane Adjemian stephane.adjemian@univ-lemans.fr March, 2016 cba

Research Designs for Causal Inference Department of Political Science and Government Aarhus

Sequence to Sequence Models Matt Gormley Lecture 5 Sep. 11, 2019 1 Q&A Q: What did the

Slow Down to Go Fast: Lessons Learned Shipping Bing Voice Search on Xbox James Waletzky Director

CPU Scheduling Schedulers in the OS Structure of a CPU Scheduler Scheduling =

Towards Binary-Valued Gates for Robust LSTM Training Zhuohan Li , Di He, Fei Tian, Wei Chen, Tao

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp - PowerPoint PPT Presentation

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation: Neural Machine Translation 6 October 2020 Language Models 1 Modeling variants feed-forward neural network recurrent neural network long

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Convolutional over Recurrent Encoder for Neural Machine Translation Praveen Dakwale and Christof

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Objectives RNNs are trained only for limited timesteps Can they form long term memories?

A training study to enhance verbal short- term memory performance in individuals with Down

Perfect foresight models St ephane Adjemian stephane.adjemian@univ-lemans.fr March, 2016 cba

Research Designs for Causal Inference Department of Political Science and Government Aarhus

Sequence to Sequence Models Matt Gormley Lecture 5 Sep. 11, 2019 1 Q&amp;A Q: What did the

Slow Down to Go Fast: Lessons Learned Shipping Bing Voice Search on Xbox James Waletzky Director

CPU Scheduling Schedulers in the OS Structure of a CPU Scheduler Scheduling =

Towards Binary-Valued Gates for Robust LSTM Training Zhuohan Li , Di He, Fei Tian, Wei Chen, Tao

Sequence to Sequence Models Matt Gormley Lecture 5 Sep. 11, 2019 1 Q&A Q: What did the