neural machine translation
play

Neural Machine Translation Spring 2020 2020-03-12 Adapted from - PowerPoint PPT Presentation

SFU NatLangLab CMPT 825: Natural Language Processing Neural Machine Translation Spring 2020 2020-03-12 Adapted from slides from Danqi Chen, Karthik Narasimhan, and Jetic Gu. (with some content from slides from Abigail See, Graham Neubig) Course


  1. SFU NatLangLab CMPT 825: Natural Language Processing Neural Machine Translation Spring 2020 2020-03-12 Adapted from slides from Danqi Chen, Karthik Narasimhan, and Jetic Gu. (with some content from slides from Abigail See, Graham Neubig)

  2. Course Logistics ‣ Project proposal is due today ‣ What problem are you addressing? Why is it interesting? ‣ What specific aspects will your project be on? ‣ Re-implement paper? Compare di ff erent methods? ‣ What data do you plan to use? ‣ What is your method? ‣ How do you plan to evaluate? What metrics?

  3. Last time • Statistical MT • Word-based • Phrase-based • Syntactic

  4. Neural Machine Translation ‣ A single neural network is used to translate from source to target ‣ Architecture: Encoder-Decoder ‣ Two main components: ‣ Encoder: Convert source sentence (input) into a vector/ matrix ‣ Decoder: Convert encoding into a sentence in target language (output)

  5. Sequence to Sequence learning (Seq2seq) • Encode entire input sequence into a single vector (using an RNN) • Decode one word at a time (again, using an RNN!) • Beam search for better inference • Learning is not trivial! (vanishing/exploding gradients) (Sutskever et al., 2014)

  6. Encoder Sentence: This cat is cute h h t +3 h t h t +1 h t +2 h t − 1 x t +3 x t x t +1 x t +2 word embedding This cat is cute

  7. Encoder Sentence: This cat is cute h h t +3 h 0 h 1 h t +1 h t +2 x t +3 x 1 x t +1 x t +2 word embedding This cat is cute

  8. Encoder Sentence: This cat is cute h h t +3 h 4 h 0 h 1 h 2 h t +2 h 3 x t +3 x 4 x 1 x 2 x t +2 x 3 word embedding This cat is cute

  9. Encoder (encoded representation) Sentence: This cat is cute h enc h 4 h 0 h 1 h 2 h 3 x 4 x 1 x 2 x 3 word embedding This cat is cute

  10. Decoder est mignon <e> ce chat o o o o o z 5 z 4 h enc z 3 z 1 z 2 x ′ x ′ x ′ x ′ x ′ 4 5 1 2 3 word embedding <s> ce chat est mianon mignon

  11. Decoder est mignon <e> ce chat o o o o o z 5 z 4 h enc z 3 z 1 z 2 x ′ x ′ y 1 x ′ x ′ 4 5 2 3 word embedding <s> ce chat est mianon mignon

  12. Decoder est mignon <e> ce chat o o o o o z 5 z 4 h enc z 3 z 1 z 2 x ′ x ′ y 1 x ′ y 2 4 5 3 word embedding <s> ce chat est mianon mignon

  13. Decoder • A conditioned language model est mignon <e> ce chat o o o o o z 5 z 4 h enc z 3 z 1 z 2 y 5 y 4 y 1 y 3 y 2 word embedding <s> ce chat est mianon

  14. Seq2seq training ‣ Similar to training a language model! ‣ Minimize cross-entropy loss: T ∑ − log P ( y t | y 1 , . . . , y t − 1 , x 1 , . . . , x n ) t =1 ‣ Back-propagate gradients through both decoder and encoder ‣ Need a really big corpus 36M sentence pairs Russian : Машинный перевод - это крут o! English: Machine translation is cool!

  15. Seq2seq training (slide credit: Abigail See)

  16. Remember masking Use masking to help compute loss for batched sequences 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0

  17. Scheduled Sampling Possible decay schedules (probability using true y decays over time) (figure credit: Bengio et al, 2015)

  18. How seq2seq changed the MT landscape

  19. MT Progress (source: Rico Sennrich)

  20. (Wu et al., 2016)

  21. NMT vs SMT Pros Cons ‣ Requires more data and ‣ Better performance compute ‣ Fluency ‣ Less interpretable ‣ Longer context ‣ Hard to debug ‣ Single NN optimized end-to-end ‣ Uncontrollable ‣ Less engineering ‣ Heavily dependent on data - could lead to unwanted biases ‣ Works out of the box for many ‣ More parameters language pairs

  22. Seq2Seq for more than NMT Task/Application Input Output Machine Translation French English Summarization Document Short Summary Dialogue Utterance Response Parse tree Parsing Sentence (as sequence) Question Answering Context + Question Answer

  23. Cross-Modal Seq2Seq Task/Application Input Output Speech Recognition Speech Signal Transcript Image Captioning Image Text Video Captioning Video Text

  24. Issues with vanilla seq2seq Bottleneck ‣ A single encoding vector, h enc , needs to capture all the information about source sentence ‣ Longer sequences can lead to vanishing gradients ‣ Overfitting

  25. Issues with vanilla seq2seq Bottleneck ‣ A single encoding vector, h enc , needs to capture all the information about source sentence ‣ Longer sequences can lead to vanishing gradients ‣ Overfitting

  26. Remember alignments?

  27. Attention ‣ The neural MT equivalent of alignment models ‣ Key idea: At each time step during decoding, focus on a particular part of source sentence ‣ This depends on the decoder’s current hidden state (i.e. notion of what you are trying to decode) ‣ Usually implemented as a probability distribution over the h enc hidden states of the encoder ( ) i

  28. Seq2seq with attention (slide credit: Abigail See)

  29. Seq2seq with attention (slide credit: Abigail See)

  30. Seq2seq with attention (slide credit: Abigail See)

  31. ̂ Seq2seq with attention Can also use as input y 1 for next time step (slide credit: Abigail See)

  32. Seq2seq with attention (slide credit: Abigail See)

  33. Computing attention ‣ Encoder hidden states: h enc 1 , . . . , h enc n ‣ Decoder hidden state at time : t h dec t ‣ First, get attention scores for this time step (we will see what is soon!): 
 g e t = [ g ( h enc 1 , h dec ), . . . , g ( h enc n , h dec )] t t ‣ Obtain the attention distribution using softmax: 
 α t = softmax ( e t ) ∈ ℝ n ‣ Compute weighted sum of encoder hidden states: 
 n ∑ α t i h enc ∈ ℝ h a t = i i =1 ‣ Finally, concatenate with decoder state and pass on to output layer: [ a t ; h dec ] ∈ ℝ 2 h t

  34. 
 Types of attention ‣ Assume encoder hidden states and decoder hidden state h 1 , h 2 , . . . , h n z 1. Dot-product attention (assumes equal dimensions for and : 
 a b e i = g ( h i , z ) = z T h i ∈ ℝ 2. Multiplicative attention: 
 g ( h i , z ) = z T Wh i ∈ ℝ , where is a weight matrix W 3. Additive attention: 
 g ( h i , z ) = v T tanh ( W 1 h i + W 2 z ) ∈ ℝ where are weight matrices and is a weight vector W 1 , W 2 v

  35. Issues with vanilla seq2seq Bottleneck ‣ A single encoding vector, h enc , needs to capture all the information about source sentence ‣ Longer sequences can lead to vanishing gradients ‣ Overfitting

  36. Dropout ‣ Form of regularization for RNNs (and any NN in general) ‣ Idea: “Handicap” NN by removing hidden units stochastically ‣ set each hidden unit in a layer to 0 with probability p during training ( usually works well) p = 0.5 ‣ scale outputs by 1/(1 − p ) ‣ hidden units forced to learn more general patterns ‣ Test time: Use all activations (no need to rescale)

  37. Handling large vocabularies ‣ Softmax can be expensive for large vocabularies exp( w i ⋅ h + b i ) P ( y i ) = Expensive to ∑ | V | j =1 exp( w j ⋅ h + b j ) compute ‣ English vocabulary size: 10K to 100K

  38. Approximate Softmax ‣ Negative Sampling ‣ Structured softmax ‣ Embedding prediction

  39. Negative Sampling • Softmax is expensive when vocabulary size is large (figure credit: Graham Neubig)

  40. Negative Sampling • Sample just a subset of the vocabulary for negative • Saw simple negative sampling in word2vec (Mikolov 2013) Other ways to sample: Importance Sampling (Bengio and Senecal 2003) (figure credit: Graham Neubig) Noise Contrastive Estimation (Mnih & Teh 2012)

  41. Hierarchical softmax (Morin and Bengio 2005) (figure credit: Quora)

  42. Class based softmax ‣ Two-layer: cluster words into classes, predict class and then predict word. (figure credit: Graham Neubig) ‣ Clusters can be based on frequency , random , or word contexts . (Gooding 2001, Mikolov et al 2011)

  43. Embedding prediction ‣ Directly predict embeddings of outputs themselves ‣ What loss to use? (Kumar and Tsvetkov 2019) ‣ L2? Cosine? ‣ Von-Mises Fisher distribution loss, make embeddings close on the unit ball (slide credit: Graham Neubig)

  44. Generation How can we use our model (decoder) to generate sentences? • Sampling: Try to generate a random sentence according the the probability distribution • Argmax: Try to generate the best sentence, the sentence with the highest probability

  45. Decoding Strategies ‣ Ancestral sampling ‣ Greedy decoding ‣ Exhaustive search ‣ Beam search

  46. Ancestral Sampling • Randomly sample words one by one • Provides diverse output (high variance) (figure credit: Luong, Cho, and Manning)

  47. Greedy decoding ‣ Compute argmax at every step of decoder to generate word ‣ What’s wrong?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend