. . . . . . . . . . . . . . . . Transformer Ablation Studies Simon Will Institute of Formal and Applied Linguistics Charles University Seminar: Statistical Machine Translation Instructor: Dr. Ondřej Bojar . . . . . . . . . . . . . . . . . . . . . . . . May 2019, 30
. . . . . . . . . . . . . . . . Structure The Transformer Ablation Concerns Feed Forward Layers Positional Embeddings . . . . . . . . . . . . . . . . . . . . . . . . Self-Attention Keys and Queries
. . . . . . . . . . . . . . . . . Idea Ott et al. 2018; Dai et al. 2019) part contributes . . . . . . . . . . . . . . . . . . . . . . . → Train similar models difgering in crucial points ▶ Transformer successful and many variations exist (e.g. ▶ Diffjcult to know what the essentials are and what each
. . . . . . . . . . . . . . . . . . Transformer (Vaswani et al. 2017) . . . . . . . . . . . . . . . . . . . . . . ▶ Encoder-Decoder Architecture based on attention ▶ No recurrence ▶ Constant in source and target “time” while training ▶ In inference, only constant in source “time” ▶ Better parallelizable than RNN-based network
. . . . . . . . . . . . . . . . . . Transformer Illustration . . . . . . . . . . . . . . . . . . . . . . Figure: Two-Layer Transformer (Image from Alammar 2018)
. . . . . . . . . . . . . . . . . . Ablation Concerns . . . . . . . . . . . . . . . . . . . . . . Figure: Areas of Concern for this Project
. . . . . . . . . . . . . . . . . Feed Forward Layers decoder not clear. → Three confjgurations: . . . . . . . . . . . . . . . . . . . . . . . ▶ Contribution of Feed Forward Layers in encoder and ▶ Is the attention enough? ▶ No encoder FF layer ▶ No decoder FF layer ▶ No decoder and no encoder FF layer
. . . . . . . . . . . . . . . . . . Ablation Concerns . . . . . . . . . . . . . . . . . . . . . . Figure: Areas of Concern for this Project
. . . . . . . . . . . . . . . . . . Ablation Concerns . . . . . . . . . . . . . . . . . . . . . . Figure: Areas of Concern for this Project
. . . . . . . . . . . . . . . Positional Embeddings → Add information via explicit positional embeddings et al. 2017) pos key dimensionality pos key dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . ▶ No recurrence → no information about order of tokens ▶ Added to the word embedding vector ▶ Two types: ▶ Learned embeddings of absolute position (e.g. Gehring ▶ Sinusoidal embeddings (used in Vaswani et al. 2017) ▶ ( ) PE ( pos , 2 i ) = sin 2 i 10000 ( ) PE ( pos , 2 i +1) = cos 2 i 10000
. . . . . . . . . . . . . . . . . . Positional Embeddings Illustrated Figure: Illustrated positional embeddings for 20 token sentence and . . . . . . . . . . . . . . . . . . . . . . key dimensionality 512 (taken from Alammar 2018)
. . . . . . . . . . . . . . . . . Modifjcations pos key dimensionality . . . . . . . . . . . . . . . . . . . . . . . ▶ Vary “rainbow stretch” by introducing stretching factor α ( ) PE ( pos , 2 i ) = sin 2 i α 10000 ▶ Expectations: ▶ α too low: No positional information ▶ α too high: Word embedding information destroyed ▶ α other than 1 optimal
. . . . . . . . . . . . . . . . . Self-Attention Keys and Queries d k generation with difgerent matrices . . . . . . . . . . . . . . . . . . . . . . . ▶ ( ( QW Q )( KW K ) T ) Attention ( Q , K , V ) = softmax ( VW V ) √ ▶ In encoder, source words are used for key and query ▶ Modifjcation: Use same matrix for both
. . . . . . . . . . . . . . . . . . Experiment Design . . . . . . . . . . . . . . . . . . . . . . ▶ Do all basic confjgurations ▶ Combine well-performing modifjcations ▶ How to compare? ▶ BLEU score on test set at best dev set performance ▶ Whole learning curves (similar to Popel and Bojar 2018)
. . . . . . . . . . . . . . . . . . Dataset . . . . . . . . . . . . . . . . . . . . . . performance ▶ Parallel image captions (Elliott et al. 2016) ▶ https://github.com/multi30k/dataset ▶ Short sentences ▶ Rather small (30k sentences) ▶ Good because fjtting takes less than a day ▶ Bad because dev and test performance is far below train
. . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . ▶ Experiments still pending ▶ Expecting to see mainly negative results ▶ Hopefully some positive ones
. . . . . . . . . . . . . . . . . Nice translation by the system “eine junge frau hält eine blume , um sich an der blume zu halten .” “a young woman is holding a fmower in order to hold on to the . . . . . . . . . . . . . . . . . . . . . . . fmower .”
. . . . . . . . . . . . . . . . References I Dai, Zihang et al. (2019). “Transformer-XL: Attentive Language Models Elliott, Desmond et al. (2016). “Multi30K: Multilingual English-German and Language . Berlin, Germany: Association for Computational Gehring, Jonas et al. (2017). “Convolutional Sequence to Sequence Ott, Myle et al. (2018). “Scaling Neural Machine Translation”. In: CoRR . . . . . . . . . . . . . . . . . . . . . . . . Alammar, Jay (2018). The Illustrated Transformer . url : http: //jalammar.github.io/illustrated-transformer/ . Beyond a Fixed-Length Context”. In: CoRR abs/1901.02860. arXiv: 1901.02860 . url : http://arxiv.org/abs/1901.02860 . Image Descriptions”. In: Proceedings of the 5th Workshop on Vision Linguistics, pp. 70–74. doi : 10.18653/v1/W16-3210 . url : http://www.aclweb.org/anthology/W16-3210 . Learning”. In: CoRR abs/1705.03122. arXiv: 1705.03122 . url : http://arxiv.org/abs/1705.03122 . abs/1806.00187. arXiv: 1806.00187 . url : http://arxiv.org/abs/1806.00187 .
. . . . . . . . . . . . . . . . . References II Popel, Martin and Ondrej Bojar (2018). “Training Tips for the Vaswani, Ashish et al. (2017). “Attention Is All You Need”. In: CoRR . . . . . . . . . . . . . . . . . . . . . . . Transformer Model”. In: CoRR abs/1804.00247. arXiv: 1804.00247 . url : http://arxiv.org/abs/1804.00247 . abs/1706.03762. arXiv: 1706.03762 . url : http://arxiv.org/abs/1706.03762 .
Asynchronous Hyperparameter Tuning and Ablation Studies with Apache Spark
Transformer Inrush and Voltage Sag P28 Studies August 2017 Introduction
FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster
Electronic DC Transformer Pavol Bauer Learning objectives What is an
L ABLATION DANS LES ABLATION DANS LES L ARYTHMIES ARYTHMIES
IEEE Transformer Committee PC57.167 Distribution Transformer Monitoring -
Transformer Program Trevor Foster Electrical Engineering Manager Calpine
West Cape Transformer Replacement Purpose Purchase a 50 MVA Load Tap Changer
THERMAL ABLATION TATO SYSTEM Laura Sarti Product Specialist 1 Definition
Prof. S. Ben-Yaakov , DC-DC Converters [3- 1] Magnetics Design 3.1 Important
4/7/13 Fibs and Flutters: The Heart of the Matter Anita Ralstin, CNP By the
Permanent Tachyc cardi ardias as Permanent Tachy J J. . Janou Janou ek
Investor Presentation 4Q 2018 Stereotaxis at a Glance Global Leader in
Laser Rework Capabilities Utilizing different laser wavelengths allow for
Beth Israel Deaconess Medical Center Radiology Presentation Schedule Sunday,
CREO MEDICAL BRINGING ADVANCED ENERGY TO ENDOSCOPY Investor Open Event 17
An Introduction to MEA Its about knowledge. Its about MEA For Educational Use
Indicatie voor ablatie bij voorkamerfibrillatie Andrea Sarkozy Cardiologie
Modeling of Laser Ablation of LiF - Influence of Defects H. M Urbassek, Y.
Strategic U.S. Uranium and Vanadium Assets January 2019 For orward L d Look
Real-tim e tissue classification and therapy Transform ing continuum of care