attentive sequence to sequence learning
play

Attentive Sequence-to-Sequence Learning March 6, 2018 Jindich - PowerPoint PPT Presentation

NPFL116 Compendium of Neural Machine Translation Attentive Sequence-to-Sequence Learning March 6, 2018 Jindich Helcl, Jindich Libovick Charles Univeristy in Prague Faculty of Mathematics and Physics Institute of Formal and Applied


  1. NPFL116 Compendium of Neural Machine Translation Attentive Sequence-to-Sequence Learning March 6, 2018 Jindřich Helcl, Jindřich Libovický Charles Univeristy in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics

  2. Neural Network Language Models

  3. RNN Language Model defines a distribution over sentences • Train RNN as classifier for next words (unlimited history) w 1 w 2 w 3 w 4 <s> p (w 1 ) p (w 2 ) p (w 3 ) p (w 4 ) p (w 5 ) • Can be used to estimate sentence probability / perplexity → • We can sample from the distribution <s> ~w 1 ~w 2 ~w 3 ~w 4 ~w 5

  4. Two views on RNN LM distribution over sequences of words n • RNN is a for loop (functional map ) over sequential data • All outputs are conditional distributions → probabilistic ∏ P ( w 1 , . . . , w n ) = P ( w i | w i − 1 , . . . , w 1 ) i =1

  5. Vanilla Sequence-to-Sequence Model

  6. Encoder-Decoder NMT 1. A network processing the input sentence into a single vector representation ( encoder ) 2. A neural language model initialized with the output of the encoder ( decoder ) Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. “Sequence to sequence learning with neural networks.” Advances in neural information processing systems. 2014. • Exploits the conditional LM scheme • Two networks

  7. Source language input + target language LM Encoder-Decoder – Image x 1 x 2 x 3 x 4 <s> <s> ~y 1 ~y 2 ~y 3 ~y 4 ~y 5

  8. Encoder-Decoder Model – Code last_w_embeding = target_embeddings [ last_w ] y i e l d last_w = np . argmax ( l o g i t s ) l o g i t s = output_projection ( dec_output ) last_w_embeding ) dec_output = d e c _ c e l l ( state , state , last_w != ”</s>” : s t a t e = np . zero s ( emb_size ) while last_w = ”<s>” input_embedding ) state , _ = e n c _ c e l l ( encoder_state , input_embedding = source_embeddings [w] input_words : for w in last_w

  9. Encoder-Decoder Model – Formal Notation final state output or multi-layer projection i -th word score i -th decoder state initial state Decoder Data h T x j -th state initial state Encoder output embeddings (target language) input embeddings (source language) x = ( x 1 , . . . , x T x ) y = ( y 1 , . . . , y T y ) h 0 ≡ 0 h j = RNN enc ( h j − 1 , x j ) s 0 = h T x s i = RNN dec ( s i − 1 , ˆ y i ) t i +1 = U o + V o Ey i + b o , ˆ y i +1 = arg max t i +1

  10. Encoder-Decoder: Training Objective exp t i function) p and p : For output word y i we have: • Estimated conditional distribution ˆ p i = ∑ exp t i (softmax • Unknown true distribution p i , we lay p i ≡ 1 [ y i ] Cross entropy ≈ distance of ˆ p , p ) = E p ( − log ˆ L = H (ˆ p ) = − log ˆ p ( y i ) …computing ∂ L ∂ t i is super simple

  11. Implementation: Runtime vs. training y j (decoded) × runtime: ˆ training: y j (ground truth) y 1 y 2 y 3 y 4 <s> x 1 x 2 x 3 x 4 <s> loss <s> ~y 1 ~y 2 ~y 3 ~y 4 ~y 5

  12. Sutskever et al. Sutskever et al.: reversed 28.5 Bahdanau’s attention 36.5 –”–: vanilla SMT rescoring 34.8 –”–: ensemble + beam search 30.6 37.0 tuned SMT 33.0 vanilla SMT BLEU score method is way to go Why is better Bahdanau’s model worse? • Reverse input sequence • Impressive empirical results – made researchers believe NMT Evaluation on WMT14 EN → FR test set:

  13. 28.5 1,000 dimensions attention model 13.9 encoder-decoder BLEU score method With Bahdanau’s model size: 5 epochs 7.5 epochs training time 620 dimensions word embeddings Sutskever et al. GRU, 1,000 units decoder bidi GRU, 2,000 encoder 30k both 160k enc, 80k dec vocabulary Bahdanau et al. Sutskever et al. × Bahdanau et al. 4 × LSTM, 1,000 units 4 × LSTM, 1,000 units

  14. Attentive Sequence-to-Sequence Learning

  15. Main Idea long-distance dependencies as query for the source word sentence their hidden states • Same as reversing input: do not force the network to catch • Use decoder state only for target sentence dependencies and a • RNN can serve as LM — it can store the language context in

  16. Inspiration: Neural Turing Machine learning algorithmic tasks, finite imitation of a Turing Machine somehow – either by position or by content algorithmic tasks • General architecture for • Needs to address memory • In fact does not work well – it hardly manages simple • Content-based addressing → attention

  17. Attention Model x 1 x 2 x 3 x 4 <s> ... h 0 h 1 h 2 h 3 h 4 α 0 α 1 α 2 α 3 α 4 × × × × × s i-1 s i s i+1 + + ~y i ~y i+1

  18. Attention Model in Equations (1) h j T x Context vector: Attention distribution: Inputs: Attention energies: s i decoder state encoder states [ − → h j ; ← − ] h j = ∀ i = 1 . . . T x exp ( e ij ) α ij = e ij = v ⊤ a tanh ( W a s i − 1 + U a h j + b a ) ∑ T x k =1 exp ( e ik ) ∑ c i = α ij h j j =1

  19. Attention Model in Equations (2) Output projection: …context vector is mixed with the hidden state Output distribution: t i = MLP ( U o s i − 1 + V o Ey i − 1 + C o c i + b o ) p ( y i = w | s i , y i − 1 , c i ) ∝ exp ( W o t i ) w + b w

  20. Attention Visualization

  21. Attention vs. Alignment Differences between attention model and word alignment used for phrase table generation: attention (NMT) alignment (SMT) probabilistic discrete declarative imperative LM generates LM discriminates

  22. Image Captioning Attention over CNN for image classification: Source: Xu, Kelvin, et al. ”Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.” ICML . Vol. 14. 2015.

  23. Reading for the Next Week Vaswani, Ashish, et al. “Attention is all you need.” Advances in Neural Information Processing Systems. 2017. http://papers.nips.cc/paper/ 7181-attention-is-all-you-need.pdf Question: The model uses the scaled dot-product attention which is a non-parametric variant of the attention mechanism. Why do you think it is sufficient in this setup? Do you think it would work in the recurrent model as well? The way the model processes the sequence is principally different from RNNs or CNNs. Does it agree with your intuition of how language should be processed?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend