transformer ablation studies
play

Transformer Ablation Studies Simon Will Institute of Formal and - PowerPoint PPT Presentation

. . . . . . . . . . . . . . . . Transformer Ablation Studies Simon Will Institute of Formal and Applied Linguistics Charles University Seminar: Statistical Machine Translation Instructor: Dr. Ondej Bojar . . . . . . .


  1. . . . . . . . . . . . . . . . . Transformer Ablation Studies Simon Will Institute of Formal and Applied Linguistics Charles University Seminar: Statistical Machine Translation Instructor: Dr. Ondřej Bojar . . . . . . . . . . . . . . . . . . . . . . . . May 2019, 30

  2. . . . . . . . . . . . . . . . . Structure The Transformer Ablation Concerns Feed Forward Layers Positional Embeddings . . . . . . . . . . . . . . . . . . . . . . . . Self-Attention Keys and Queries

  3. . . . . . . . . . . . . . . . . . Idea Ott et al. 2018; Dai et al. 2019) part contributes . . . . . . . . . . . . . . . . . . . . . . . → Train similar models difgering in crucial points ▶ Transformer successful and many variations exist (e.g. ▶ Diffjcult to know what the essentials are and what each

  4. . . . . . . . . . . . . . . . . . . Transformer (Vaswani et al. 2017) . . . . . . . . . . . . . . . . . . . . . . ▶ Encoder-Decoder Architecture based on attention ▶ No recurrence ▶ Constant in source and target “time” while training ▶ In inference, only constant in source “time” ▶ Better parallelizable than RNN-based network

  5. . . . . . . . . . . . . . . . . . . Transformer Illustration . . . . . . . . . . . . . . . . . . . . . . Figure: Two-Layer Transformer (Image from Alammar 2018)

  6. . . . . . . . . . . . . . . . . . . Ablation Concerns . . . . . . . . . . . . . . . . . . . . . . Figure: Areas of Concern for this Project

  7. . . . . . . . . . . . . . . . . . Feed Forward Layers decoder not clear. → Three confjgurations: . . . . . . . . . . . . . . . . . . . . . . . ▶ Contribution of Feed Forward Layers in encoder and ▶ Is the attention enough? ▶ No encoder FF layer ▶ No decoder FF layer ▶ No decoder and no encoder FF layer

  8. . . . . . . . . . . . . . . . . . . Ablation Concerns . . . . . . . . . . . . . . . . . . . . . . Figure: Areas of Concern for this Project

  9. . . . . . . . . . . . . . . . . . . Ablation Concerns . . . . . . . . . . . . . . . . . . . . . . Figure: Areas of Concern for this Project

  10. . . . . . . . . . . . . . . . Positional Embeddings → Add information via explicit positional embeddings et al. 2017) pos key dimensionality pos key dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . ▶ No recurrence → no information about order of tokens ▶ Added to the word embedding vector ▶ Two types: ▶ Learned embeddings of absolute position (e.g. Gehring ▶ Sinusoidal embeddings (used in Vaswani et al. 2017) ▶ ( ) PE ( pos , 2 i ) = sin 2 i 10000 ( ) PE ( pos , 2 i +1) = cos 2 i 10000

  11. . . . . . . . . . . . . . . . . . . Positional Embeddings Illustrated Figure: Illustrated positional embeddings for 20 token sentence and . . . . . . . . . . . . . . . . . . . . . . key dimensionality 512 (taken from Alammar 2018)

  12. . . . . . . . . . . . . . . . . . Modifjcations pos key dimensionality . . . . . . . . . . . . . . . . . . . . . . . ▶ Vary “rainbow stretch” by introducing stretching factor α ( ) PE ( pos , 2 i ) = sin 2 i α 10000 ▶ Expectations: ▶ α too low: No positional information ▶ α too high: Word embedding information destroyed ▶ α other than 1 optimal

  13. . . . . . . . . . . . . . . . . . Self-Attention Keys and Queries d k generation with difgerent matrices . . . . . . . . . . . . . . . . . . . . . . . ▶ ( ( QW Q )( KW K ) T ) Attention ( Q , K , V ) = softmax ( VW V ) √ ▶ In encoder, source words are used for key and query ▶ Modifjcation: Use same matrix for both

  14. . . . . . . . . . . . . . . . . . . Experiment Design . . . . . . . . . . . . . . . . . . . . . . ▶ Do all basic confjgurations ▶ Combine well-performing modifjcations ▶ How to compare? ▶ BLEU score on test set at best dev set performance ▶ Whole learning curves (similar to Popel and Bojar 2018)

  15. . . . . . . . . . . . . . . . . . . Dataset . . . . . . . . . . . . . . . . . . . . . . performance ▶ Parallel image captions (Elliott et al. 2016) ▶ https://github.com/multi30k/dataset ▶ Short sentences ▶ Rather small (30k sentences) ▶ Good because fjtting takes less than a day ▶ Bad because dev and test performance is far below train

  16. . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . ▶ Experiments still pending ▶ Expecting to see mainly negative results ▶ Hopefully some positive ones

  17. . . . . . . . . . . . . . . . . . Nice translation by the system “eine junge frau hält eine blume , um sich an der blume zu halten .” “a young woman is holding a fmower in order to hold on to the . . . . . . . . . . . . . . . . . . . . . . . fmower .”

  18. . . . . . . . . . . . . . . . . References I Dai, Zihang et al. (2019). “Transformer-XL: Attentive Language Models Elliott, Desmond et al. (2016). “Multi30K: Multilingual English-German and Language . Berlin, Germany: Association for Computational Gehring, Jonas et al. (2017). “Convolutional Sequence to Sequence Ott, Myle et al. (2018). “Scaling Neural Machine Translation”. In: CoRR . . . . . . . . . . . . . . . . . . . . . . . . Alammar, Jay (2018). The Illustrated Transformer . url : http: //jalammar.github.io/illustrated-transformer/ . Beyond a Fixed-Length Context”. In: CoRR abs/1901.02860. arXiv: 1901.02860 . url : http://arxiv.org/abs/1901.02860 . Image Descriptions”. In: Proceedings of the 5th Workshop on Vision Linguistics, pp. 70–74. doi : 10.18653/v1/W16-3210 . url : http://www.aclweb.org/anthology/W16-3210 . Learning”. In: CoRR abs/1705.03122. arXiv: 1705.03122 . url : http://arxiv.org/abs/1705.03122 . abs/1806.00187. arXiv: 1806.00187 . url : http://arxiv.org/abs/1806.00187 .

  19. . . . . . . . . . . . . . . . . . References II Popel, Martin and Ondrej Bojar (2018). “Training Tips for the Vaswani, Ashish et al. (2017). “Attention Is All You Need”. In: CoRR . . . . . . . . . . . . . . . . . . . . . . . Transformer Model”. In: CoRR abs/1804.00247. arXiv: 1804.00247 . url : http://arxiv.org/abs/1804.00247 . abs/1706.03762. arXiv: 1706.03762 . url : http://arxiv.org/abs/1706.03762 .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend