neural machine translation
play

Neural Machine Translation Luke Zettlemoyer (Slides adapted from - PowerPoint PPT Presentation

CSEP 517 Natural Language Processing Neural Machine Translation Luke Zettlemoyer (Slides adapted from Karthik Narasimhan, Greg Durrett, Chris Manning, Dan Jurafsky) Last time Statistical MT Word-based Phrase-based Syntactic


  1. CSEP 517 Natural Language Processing Neural Machine Translation Luke Zettlemoyer (Slides adapted from Karthik Narasimhan, Greg Durrett, Chris Manning, Dan Jurafsky)

  2. Last time • Statistical MT • Word-based • Phrase-based • Syntactic

  3. NMT: the biggest success story of NLP Deep Learning Neural Machine Translation went from a fringe research activity in 2014 to the leading standard method in 2016 • 2014 : First seq2seq paper published • 2016 : Google Translate switches from SMT to NMT • This is amazing! • SMT systems, built by hundreds of engineers over many years, outperformed by NMT systems trained by a handful of engineers in a few months 3

  4. Neural Machine Translation ‣ A single neural network is used to translate from source to target ‣ Architecture: Encoder-Decoder ‣ Two main components: ‣ Encoder: Convert source sentence (input) into a vector/matrix ‣ Decoder: Convert encoding into a sentence in target language (output)

  5. Recall: RNNs h t = g ( Wh t − 1 + Ux t + b ) ∈ ℝ d

  6. Sequence to Sequence learning (Seq2seq) • Encode entire input sequence into a single vector (using an RNN) • Decode one word at a time (again, using an RNN!) • Beam search for better inference • Learning is not trivial! (vanishing/exploding gradients) (Sutskever et al., 2014)

  7. Neural Machine Translation (NMT) Target sentence (output) Encoding of the source sentence. poor don’t have any money <END> the Provides initial hidden state for Decoder RNN. argmax argmax argmax argmax argmax argmax argmax Decoder RNN Encoder RNN the poor don’t have any money les pauvres sont démunis <START> Source sentence (input) Decoder RNN is a Language Model that generates target sentence conditioned on encoding. Encoder RNN produces an encoding of the Note: This diagram shows test time behavior: source sentence. decoder output is fed in --> as next step’s input

  8. Seq2seq training ‣ Similar to training a language model! ‣ Minimize cross-entropy loss: T ∑ − log P ( y t | y 1 , . . . , y t − 1 , x 1 , . . . , x n ) t =1 ‣ Back-propagate gradients through both decoder and encoder ‣ Need a really big corpus 36M sentence pairs Russian : Машинный перевод - это крут o! English: Machine translation is cool!

  9. Training a Neural Machine Translation system = negative log = negative log = negative log prob of “have” prob of “the” prob of <END> 𝑈 𝐾 = 1 ∑ = + + + + + + 𝐾 𝑢 𝐾 3 𝐾 1 𝐾 4 𝐾 2 𝐾 5 𝐾 6 𝐾 7 𝑈 𝑢 =1 ^ ^ ^ ^ ^ ^ ^ 𝑧 3 𝑧 4 𝑧 1 𝑧 2 𝑧 5 𝑧 6 𝑧 7 Encoder RNN Decoder RNN <START> the poor don’t have any money les pauvres sont démunis Target sentence (from corpus) Source sentence (from corpus) Seq2seq is optimized as a single system. Backpropagation operates “ end to end” .

  10. Greedy decoding ‣ Compute argmax at every step of decoder to generate word ‣ What’s wrong?

  11. Exhaustive search? ‣ Find P ( y 1 , . . . , y T | x 1 , . . . , x n ) arg max y 1 ,..., y T ‣ Requires computing all possible sequences ‣ O ( V T ) complexity! ‣ Too expensive

  12. A middle ground: Beam search ‣ Key idea: At every step, keep track of the k most probable partial translations (hypotheses) ‣ Score of each hypothesis = log probability j ∑ log P ( y t | y 1 , . . . , y t − 1 , x 1 , . . . , x n ) t =1 ‣ Not guaranteed to be optimal ‣ More e ffi cient than exhaustive search

  13. Beam decoding (slide credit: Abigail See)

  14. Beam decoding (slide credit: Abigail See)

  15. Beam decoding (slide credit: Abigail See)

  16. Beam decoding ‣ Di ff erent hypotheses may produce (end) token at di ff erent time steps ⟨ e ⟩ ‣ When a hypothesis produces , stop expanding it and place it aside ⟨ e ⟩ ‣ Continue beam search until: ‣ All hypotheses produce OR ⟨ e ⟩ k ‣ Hit max decoding limit T ‣ Select top hypotheses using the normalized likelihood score T 1 ∑ log P ( y t | y 1 , . . . , y t − 1 , x 1 , . . . , x n ) T t =1 ‣ Otherwise shorter hypotheses have higher scores

  17. NMT vs SMT Pros Cons ‣ Requires more data and compute ‣ Better performance ‣ Less interpretable ‣ Fluency ‣ Hard to debug ‣ Longer context ‣ Uncontrollable ‣ Single NN optimized end-to- ‣ Heavily dependent on data - end could lead to unwanted ‣ Less engineering biases ‣ More parameters ‣ Works out of the box for many language pairs

  18. How seq2seq changed the MT landscape

  19. MT Progress (source: Rico Sennrich)

  20. Versatile seq2seq ‣ Seq2seq finds applications in many other tasks! ‣ Any task where inputs and outputs are sequences of words/ characters ‣ Summarization (input text summary) → ‣ Dialogue (previous utterance reply) → ‣ Parsing (sentence parse tree in sequence form) → ‣ Question answering (context+question answer) →

  21. Issues with vanilla seq2seq Bottleneck ‣ A single encoding vector, h enc , needs to capture all the information about source sentence ‣ Longer sequences can lead to vanishing gradients ‣ Overfitting

  22. Remember alignments?

  23. Attention ‣ The neural MT equivalent of alignment models ‣ Key idea: At each time step during decoding, focus on a particular part of source sentence ‣ This depends on the decoder’s current hidden state (i.e. notion of what you are trying to decode) ‣ Usually implemented as a probability distribution over h enc the hidden states of the encoder ( ) i

  24. Sequence-to-sequence with attention dot product Attention scores Decoder RNN Encoder RNN <START> les pauvres sont démunis Source sentence (input)

  25. Sequence-to-sequence with attention dot product Attention scores Decoder RNN Encoder RNN <START> les pauvres sont démunis Source sentence (input)

  26. Sequence-to-sequence with attention dot product Attention scores Decoder RNN Encoder RNN <START> les pauvres sont démunis Source sentence (input)

  27. Sequence-to-sequence with attention dot product Attention scores Decoder RNN Encoder RNN <START> les pauvres sont démunis Source sentence (input)

  28. Sequence-to-sequence with attention On this decoder timestep, we’re mostly focusing on the first distribution Attention encoder hidden state ( ”les” ) Take softmax to turn the scores into a probability Attention distribution scores Decoder RNN Encoder RNN <START> les pauvres sont démunis Source sentence (input)

  29. Sequence-to-sequence with attention Attention Use the attention distribution to take a output weighted sum of the encoder hidden states. distribution Attention The attention output mostly contains information the hidden states that received high attention. Attention scores Decoder RNN Encoder RNN <START> les pauvres sont démunis Source sentence (input)

  30. Sequence-to-sequence with attention Attention the output Concatenate attention output distribution ^ with decoder hidden state, Attention 𝑧 1 then use to compute as ^ 𝑧 1 before Attention scores Decoder RNN Encoder RNN <START> les pauvres sont démunis Source sentence (input)

  31. Sequence-to-sequence with attention Attention poor output distribution ^ Attention 𝑧 2 Attention scores Decoder RNN Encoder RNN <START> the les pauvres sont démunis Source sentence (input)

  32. Sequence-to-sequence with attention Attention don’t output distribution ^ Attention 𝑧 3 Attention scores Decoder RNN Encoder RNN poor <START> the les pauvres sont démunis Source sentence (input)

  33. Sequence-to-sequence with attention Attention have output distribution ^ Attention 𝑧 4 Attention scores Decoder RNN Encoder RNN poor don’t <START> the les pauvres sont démunis Source sentence (input)

  34. Sequence-to-sequence with attention Attention any output distribution ^ Attention 𝑧 5 Attention scores Decoder RNN Encoder RNN poor don’t have <START> the les pauvres sont démunis Source sentence (input)

  35. Sequence-to-sequence with attention Attention money output ^ distribution Attention 𝑧 6 Attention scores Decoder RNN Encoder RNN any poor don’t have <START> the les pauvres sont démunis Source sentence (input)

  36. Computing attention ‣ Encoder hidden states: h enc 1 , . . . , h enc n ‣ Decoder hidden state at time : t h dec t ‣ First, get attention scores for this time step (we will see what is soon!): 
 g e t = [ g ( h enc 1 , h dec ), . . . , g ( h enc n , h dec )] t t ‣ Obtain the attention distribution using softmax: 
 α t = softmax ( e t ) ∈ ℝ n ‣ Compute weighted sum of encoder hidden states: 
 n ∑ α t i h enc ∈ ℝ h a t = i i =1 ‣ Finally, concatenate with decoder state and pass on to output layer: [ a t ; h dec ] ∈ ℝ 2 h t

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend