transformer ablation studies
play

. . . . . . . . . . . . . . . . Transformer Ablation Studies Simon Will Institute of Formal and Applied Linguistics Charles University Seminar: Statistical Machine Translation Instructor: Dr. Ondej Bojar . . . . . . .

0 downloads 7 Views 753 KB Size Report
  1. . . . . . . . . . . . . . . . . Transformer Ablation Studies Simon Will Institute of Formal and Applied Linguistics Charles University Seminar: Statistical Machine Translation Instructor: Dr. Ondřej Bojar . . . . . . . . . . . . . . . . . . . . . . . . May 2019, 30

  2. . . . . . . . . . . . . . . . . Structure The Transformer Ablation Concerns Feed Forward Layers Positional Embeddings . . . . . . . . . . . . . . . . . . . . . . . . Self-Attention Keys and Queries

  3. . . . . . . . . . . . . . . . . . Idea Ott et al. 2018; Dai et al. 2019) part contributes . . . . . . . . . . . . . . . . . . . . . . . → Train similar models difgering in crucial points ▶ Transformer successful and many variations exist (e.g. ▶ Diffjcult to know what the essentials are and what each

  4. . . . . . . . . . . . . . . . . . . Transformer (Vaswani et al. 2017) . . . . . . . . . . . . . . . . . . . . . . ▶ Encoder-Decoder Architecture based on attention ▶ No recurrence ▶ Constant in source and target “time” while training ▶ In inference, only constant in source “time” ▶ Better parallelizable than RNN-based network

  5. . . . . . . . . . . . . . . . . . . Transformer Illustration . . . . . . . . . . . . . . . . . . . . . . Figure: Two-Layer Transformer (Image from Alammar 2018)

  6. . . . . . . . . . . . . . . . . . . Ablation Concerns . . . . . . . . . . . . . . . . . . . . . . Figure: Areas of Concern for this Project

  7. . . . . . . . . . . . . . . . . . Feed Forward Layers decoder not clear. → Three confjgurations: . . . . . . . . . . . . . . . . . . . . . . . ▶ Contribution of Feed Forward Layers in encoder and ▶ Is the attention enough? ▶ No encoder FF layer ▶ No decoder FF layer ▶ No decoder and no encoder FF layer

  8. . . . . . . . . . . . . . . . . . . Ablation Concerns . . . . . . . . . . . . . . . . . . . . . . Figure: Areas of Concern for this Project

  9. . . . . . . . . . . . . . . . . . . Ablation Concerns . . . . . . . . . . . . . . . . . . . . . . Figure: Areas of Concern for this Project

  10. . . . . . . . . . . . . . . . Positional Embeddings → Add information via explicit positional embeddings et al. 2017) pos key dimensionality pos key dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . ▶ No recurrence → no information about order of tokens ▶ Added to the word embedding vector ▶ Two types: ▶ Learned embeddings of absolute position (e.g. Gehring ▶ Sinusoidal embeddings (used in Vaswani et al. 2017) ▶ ( ) PE ( pos , 2 i ) = sin 2 i 10000 ( ) PE ( pos , 2 i +1) = cos 2 i 10000

  11. . . . . . . . . . . . . . . . . . . Positional Embeddings Illustrated Figure: Illustrated positional embeddings for 20 token sentence and . . . . . . . . . . . . . . . . . . . . . . key dimensionality 512 (taken from Alammar 2018)

  12. . . . . . . . . . . . . . . . . . Modifjcations pos key dimensionality . . . . . . . . . . . . . . . . . . . . . . . ▶ Vary “rainbow stretch” by introducing stretching factor α ( ) PE ( pos , 2 i ) = sin 2 i α 10000 ▶ Expectations: ▶ α too low: No positional information ▶ α too high: Word embedding information destroyed ▶ α other than 1 optimal

  13. . . . . . . . . . . . . . . . . . Self-Attention Keys and Queries d k generation with difgerent matrices . . . . . . . . . . . . . . . . . . . . . . . ▶ ( ( QW Q )( KW K ) T ) Attention ( Q , K , V ) = softmax ( VW V ) √ ▶ In encoder, source words are used for key and query ▶ Modifjcation: Use same matrix for both

  14. . . . . . . . . . . . . . . . . . . Experiment Design . . . . . . . . . . . . . . . . . . . . . . ▶ Do all basic confjgurations ▶ Combine well-performing modifjcations ▶ How to compare? ▶ BLEU score on test set at best dev set performance ▶ Whole learning curves (similar to Popel and Bojar 2018)

  15. . . . . . . . . . . . . . . . . . . Dataset . . . . . . . . . . . . . . . . . . . . . . performance ▶ Parallel image captions (Elliott et al. 2016) ▶ https://github.com/multi30k/dataset ▶ Short sentences ▶ Rather small (30k sentences) ▶ Good because fjtting takes less than a day ▶ Bad because dev and test performance is far below train

  16. . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . ▶ Experiments still pending ▶ Expecting to see mainly negative results ▶ Hopefully some positive ones

  17. . . . . . . . . . . . . . . . . . Nice translation by the system “eine junge frau hält eine blume , um sich an der blume zu halten .” “a young woman is holding a fmower in order to hold on to the . . . . . . . . . . . . . . . . . . . . . . . fmower .”

  18. . . . . . . . . . . . . . . . . References I Dai, Zihang et al. (2019). “Transformer-XL: Attentive Language Models Elliott, Desmond et al. (2016). “Multi30K: Multilingual English-German and Language . Berlin, Germany: Association for Computational Gehring, Jonas et al. (2017). “Convolutional Sequence to Sequence Ott, Myle et al. (2018). “Scaling Neural Machine Translation”. In: CoRR . . . . . . . . . . . . . . . . . . . . . . . . Alammar, Jay (2018). The Illustrated Transformer . url : http: //jalammar.github.io/illustrated-transformer/ . Beyond a Fixed-Length Context”. In: CoRR abs/1901.02860. arXiv: 1901.02860 . url : http://arxiv.org/abs/1901.02860 . Image Descriptions”. In: Proceedings of the 5th Workshop on Vision Linguistics, pp. 70–74. doi : 10.18653/v1/W16-3210 . url : http://www.aclweb.org/anthology/W16-3210 . Learning”. In: CoRR abs/1705.03122. arXiv: 1705.03122 . url : http://arxiv.org/abs/1705.03122 . abs/1806.00187. arXiv: 1806.00187 . url : http://arxiv.org/abs/1806.00187 .

  19. . . . . . . . . . . . . . . . . . References II Popel, Martin and Ondrej Bojar (2018). “Training Tips for the Vaswani, Ashish et al. (2017). “Attention Is All You Need”. In: CoRR . . . . . . . . . . . . . . . . . . . . . . . Transformer Model”. In: CoRR abs/1804.00247. arXiv: 1804.00247 . url : http://arxiv.org/abs/1804.00247 . abs/1706.03762. arXiv: 1706.03762 . url : http://arxiv.org/abs/1706.03762 .

Recommend Documents


asynchronous hyperparameter tuning and ablation studies
Asynchronous Hyperparameter Tuning and

Asynchronous Hyperparameter Tuning and Ablation Studies with Apache Spark

voltage sag p28 studies
Voltage Sag P28 Studies August 2017

Transformer Inrush and Voltage Sag P28 Studies August 2017 Introduction

faster transformer
FASTER TRANSFORMER Bo Yang Hsueh,

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster

electronic dc transformer pavol bauer learning objectives
Electronic DC Transformer Pavol Bauer

Electronic DC Transformer Pavol Bauer Learning objectives What is an

l ablation dans les ablation dans les l arythmies
L ABLATION DANS LES ABLATION DANS

L ABLATION DANS LES ABLATION DANS LES L ARYTHMIES ARYTHMIES

ieee transformer committee pc57 167 distribution
IEEE Transformer Committee PC57.167

IEEE Transformer Committee PC57.167 Distribution Transformer Monitoring -

transformer program trevor foster electrical engineering
Transformer Program Trevor Foster

Transformer Program Trevor Foster Electrical Engineering Manager Calpine

west cape transformer
West Cape Transformer Replacement

West Cape Transformer Replacement Purpose Purchase a 50 MVA Load Tap Changer

thermal ablation
THERMAL ABLATION TATO SYSTEM Laura

THERMAL ABLATION TATO SYSTEM Laura Sarti Product Specialist 1 Definition

magnetics design
Magnetics Design 3.1 Important

Prof. S. Ben-Yaakov , DC-DC Converters [3- 1] Magnetics Design 3.1 Important

fibs and flutters the heart of the matter anita ralstin
Fibs and Flutters: The Heart of the

4/7/13 Fibs and Flutters: The Heart of the Matter Anita Ralstin, CNP By the

permanent tachyc cardi ardias as permanent tachy
Permanent Tachyc cardi ardias as

Permanent Tachyc cardi ardias as Permanent Tachy J J. . Janou Janou ek

investor presentation 4q 2018 stereotaxis at a glance
Investor Presentation 4Q 2018

Investor Presentation 4Q 2018 Stereotaxis at a Glance Global Leader in

laser rework capabilities
Laser Rework Capabilities Utilizing

Laser Rework Capabilities Utilizing different laser wavelengths allow for

radiology presentation schedule
Radiology Presentation Schedule

Beth Israel Deaconess Medical Center Radiology Presentation Schedule Sunday,

bringing advanced energy to endoscopy investor open event
BRINGING ADVANCED ENERGY TO ENDOSCOPY

CREO MEDICAL BRINGING ADVANCED ENERGY TO ENDOSCOPY Investor Open Event 17

the simple effective treatment of mea
The Simple, Effective Treatment of MEA

An Introduction to MEA Its about knowledge. Its about MEA For Educational Use

indicatie voor ablatie bij voorkamerfibrillatie
Indicatie voor ablatie bij

Indicatie voor ablatie bij voorkamerfibrillatie Andrea Sarkozy Cardiologie

modeling of laser ablation of lif influence of defects h
Modeling of Laser Ablation of LiF -

Modeling of Laser Ablation of LiF - Influence of Defects H. M Urbassek, Y.

strategic u s uranium and vanadium assets
Strategic U.S. Uranium and Vanadium

Strategic U.S. Uranium and Vanadium Assets January 2019 For orward L d Look

real tim e tissue classification and therapy
Real-tim e tissue classification and

Real-tim e tissue classification and therapy Transform ing continuum of care