Neural Machine Translation Luke Zettlemoyer (Slides adapted from - PowerPoint PPT Presentation

CSEP 517 Natural Language Processing Neural Machine Translation Luke Zettlemoyer (Slides adapted from Karthik Narasimhan, Greg Durrett, Chris Manning, Dan Jurafsky)

Last time • Statistical MT • Word-based • Phrase-based • Syntactic

NMT: the biggest success story of NLP Deep Learning Neural Machine Translation went from a fringe research activity in 2014 to the leading standard method in 2016 • 2014 : First seq2seq paper published • 2016 : Google Translate switches from SMT to NMT • This is amazing! • SMT systems, built by hundreds of engineers over many years, outperformed by NMT systems trained by a handful of engineers in a few months 3

Neural Machine Translation ‣ A single neural network is used to translate from source to target ‣ Architecture: Encoder-Decoder ‣ Two main components: ‣ Encoder: Convert source sentence (input) into a vector/matrix ‣ Decoder: Convert encoding into a sentence in target language (output)

Recall: RNNs h t = g ( Wh t − 1 + Ux t + b ) ∈ ℝ d

Sequence to Sequence learning (Seq2seq) • Encode entire input sequence into a single vector (using an RNN) • Decode one word at a time (again, using an RNN!) • Beam search for better inference • Learning is not trivial! (vanishing/exploding gradients) (Sutskever et al., 2014)

Neural Machine Translation (NMT) Target sentence (output) Encoding of the source sentence. poor don’t have any money <END> the Provides initial hidden state for Decoder RNN. argmax argmax argmax argmax argmax argmax argmax Decoder RNN Encoder RNN the poor don’t have any money les pauvres sont démunis <START> Source sentence (input) Decoder RNN is a Language Model that generates target sentence conditioned on encoding. Encoder RNN produces an encoding of the Note: This diagram shows test time behavior: source sentence. decoder output is fed in --> as next step’s input

Seq2seq training ‣ Similar to training a language model! ‣ Minimize cross-entropy loss: T ∑ − log P ( y t | y 1 , . . . , y t − 1 , x 1 , . . . , x n ) t =1 ‣ Back-propagate gradients through both decoder and encoder ‣ Need a really big corpus 36M sentence pairs Russian : Машинный перевод - это крут o! English: Machine translation is cool!

Training a Neural Machine Translation system = negative log = negative log = negative log prob of “have” prob of “the” prob of <END> 𝑈 𝐾 = 1 ∑ = + + + + + + 𝐾 𝑢 𝐾 3 𝐾 1 𝐾 4 𝐾 2 𝐾 5 𝐾 6 𝐾 7 𝑈 𝑢 =1 ^ ^ ^ ^ ^ ^ ^ 𝑧 3 𝑧 4 𝑧 1 𝑧 2 𝑧 5 𝑧 6 𝑧 7 Encoder RNN Decoder RNN <START> the poor don’t have any money les pauvres sont démunis Target sentence (from corpus) Source sentence (from corpus) Seq2seq is optimized as a single system. Backpropagation operates “ end to end” .

Greedy decoding ‣ Compute argmax at every step of decoder to generate word ‣ What’s wrong?

Exhaustive search? ‣ Find P ( y 1 , . . . , y T | x 1 , . . . , x n ) arg max y 1 ,..., y T ‣ Requires computing all possible sequences ‣ O ( V T ) complexity! ‣ Too expensive

A middle ground: Beam search ‣ Key idea: At every step, keep track of the k most probable partial translations (hypotheses) ‣ Score of each hypothesis = log probability j ∑ log P ( y t | y 1 , . . . , y t − 1 , x 1 , . . . , x n ) t =1 ‣ Not guaranteed to be optimal ‣ More e ffi cient than exhaustive search

Beam decoding (slide credit: Abigail See)

Beam decoding ‣ Di ff erent hypotheses may produce (end) token at di ff erent time steps ⟨ e ⟩ ‣ When a hypothesis produces , stop expanding it and place it aside ⟨ e ⟩ ‣ Continue beam search until: ‣ All hypotheses produce OR ⟨ e ⟩ k ‣ Hit max decoding limit T ‣ Select top hypotheses using the normalized likelihood score T 1 ∑ log P ( y t | y 1 , . . . , y t − 1 , x 1 , . . . , x n ) T t =1 ‣ Otherwise shorter hypotheses have higher scores

NMT vs SMT Pros Cons ‣ Requires more data and compute ‣ Better performance ‣ Less interpretable ‣ Fluency ‣ Hard to debug ‣ Longer context ‣ Uncontrollable ‣ Single NN optimized end-to- ‣ Heavily dependent on data - end could lead to unwanted ‣ Less engineering biases ‣ More parameters ‣ Works out of the box for many language pairs

How seq2seq changed the MT landscape

MT Progress (source: Rico Sennrich)

Versatile seq2seq ‣ Seq2seq finds applications in many other tasks! ‣ Any task where inputs and outputs are sequences of words/ characters ‣ Summarization (input text summary) → ‣ Dialogue (previous utterance reply) → ‣ Parsing (sentence parse tree in sequence form) → ‣ Question answering (context+question answer) →

Issues with vanilla seq2seq Bottleneck ‣ A single encoding vector, h enc , needs to capture all the information about source sentence ‣ Longer sequences can lead to vanishing gradients ‣ Overfitting

Remember alignments?

Attention ‣ The neural MT equivalent of alignment models ‣ Key idea: At each time step during decoding, focus on a particular part of source sentence ‣ This depends on the decoder’s current hidden state (i.e. notion of what you are trying to decode) ‣ Usually implemented as a probability distribution over h enc the hidden states of the encoder ( ) i

Sequence-to-sequence with attention dot product Attention scores Decoder RNN Encoder RNN <START> les pauvres sont démunis Source sentence (input)

Sequence-to-sequence with attention On this decoder timestep, we’re mostly focusing on the first distribution Attention encoder hidden state ( ”les” ) Take softmax to turn the scores into a probability Attention distribution scores Decoder RNN Encoder RNN <START> les pauvres sont démunis Source sentence (input)

Sequence-to-sequence with attention Attention Use the attention distribution to take a output weighted sum of the encoder hidden states. distribution Attention The attention output mostly contains information the hidden states that received high attention. Attention scores Decoder RNN Encoder RNN <START> les pauvres sont démunis Source sentence (input)

Sequence-to-sequence with attention Attention the output Concatenate attention output distribution ^ with decoder hidden state, Attention 𝑧 1 then use to compute as ^ 𝑧 1 before Attention scores Decoder RNN Encoder RNN <START> les pauvres sont démunis Source sentence (input)

Sequence-to-sequence with attention Attention poor output distribution ^ Attention 𝑧 2 Attention scores Decoder RNN Encoder RNN <START> the les pauvres sont démunis Source sentence (input)

Sequence-to-sequence with attention Attention don’t output distribution ^ Attention 𝑧 3 Attention scores Decoder RNN Encoder RNN poor <START> the les pauvres sont démunis Source sentence (input)

Sequence-to-sequence with attention Attention have output distribution ^ Attention 𝑧 4 Attention scores Decoder RNN Encoder RNN poor don’t <START> the les pauvres sont démunis Source sentence (input)

Sequence-to-sequence with attention Attention any output distribution ^ Attention 𝑧 5 Attention scores Decoder RNN Encoder RNN poor don’t have <START> the les pauvres sont démunis Source sentence (input)

Sequence-to-sequence with attention Attention money output ^ distribution Attention 𝑧 6 Attention scores Decoder RNN Encoder RNN any poor don’t have <START> the les pauvres sont démunis Source sentence (input)

Computing attention ‣ Encoder hidden states: h enc 1 , . . . , h enc n ‣ Decoder hidden state at time : t h dec t ‣ First, get attention scores for this time step (we will see what is soon!):   g e t = [ g ( h enc 1 , h dec ), . . . , g ( h enc n , h dec )] t t ‣ Obtain the attention distribution using softmax:   α t = softmax ( e t ) ∈ ℝ n ‣ Compute weighted sum of encoder hidden states:   n ∑ α t i h enc ∈ ℝ h a t = i i =1 ‣ Finally, concatenate with decoder state and pass on to output layer: [ a t ; h dec ] ∈ ℝ 2 h t

Neural Machine Translation Luke Zettlemoyer (Slides adapted from - PowerPoint PPT Presentation

CSEP 517 Natural Language Processing Neural Machine Translation Luke Zettlemoyer (Slides adapted from Karthik Narasimhan, Greg Durrett, Chris Manning, Dan Jurafsky) Last time Statistical MT Word-based Phrase-based Syntactic

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Convolutional over Recurrent Encoder for Neural Machine Translation Praveen Dakwale and Christof

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Attention Networks Jun Xiao 1 , Hao Ye 1 , Xiangnan He 2 , Hanwang Zhang 2 , Fei Wu 1 , Tat-Seng

Ques Question Answ tion Answering ering Jiyang Zhang, Tong Gao Background Image captioning and

Bradley/Taylor: Inadequate attention to and investment in services that address the broader

Attention Attention is the taking possession by the mind, in clear and vivid form, of one

Bilinear Attention Networks 2018 VQA Challenge runner-up (1st single model) Jin-Hwa Kim,

GANocracy Outline Background: Text Generation Latent-Variable Generation Learning

Sequence-to-Sequence Models Can Directly Translate Foreign Speech Ron J. Weiss, Jan Chorowski ,

Video Paragraph Captioning using Hierarchical Recurrent Neural Networks Haonan Yu, Jiang Wang,