Transformer Networks Amir Ali Moinfar - M. Soleymani Deep Learning - PowerPoint PPT Presentation

Transformer Networks Amir Ali Moinfar - M. Soleymani Deep Learning Sharif University of Technology Spring 2019 1

The “simple” translation model • Embedding each word (word2vec, trainable, …) • Some Tricks: – Teacher forcing – Reversing the input This slide has been adapted from Bhiksha Raj, 11-785, CMU 2019 2

Problems with this framework • All the information about the input is embedded into a single vector – Last hidden node is “overloaded” with information • Particularly if the input is long • Parallelization? • Problems in back propagation through sequence This slide has been adapted from Bhiksha Raj, 11-785, CMU 2019 3

Parallelization: Convolutional Models ( ) • Some work: – Neural GPU – ByteNet – ConvS2S • Limited by size of convolution • Maximum path length: – 𝑀𝑝𝑕 $ 𝑜 Kalchbrenner et al. “Neural Machine Translation in Linear Time”, 2017 4

Removing bottleneck: Attention Mechanism • Compute a weighted combination of all the hidden outputs into a single vector • Weights are functions of current output state • The weights are a distribution over the input (sum to 1) This slide has been adapted from Bhiksha Raj, 11-785, CMU 2019 5

Attention Effect in machine translation • Left: Normal RNNs and long sentences • Right: Attention map in machine translation Bahdanau et al. "Neural Machine Translation by Jointly Learning to Align and Translate", 2014 6

RNNs with Attention for VQA ( ) • Each hidden output of LSTM selects a part of image to look at Zhu et al. “Visual7W: Grounded Question Answering in Images” 2016 7

Attention Mechanism - Abstract View • A Lookup Mechanism – Query – Key – Value 8

Attention Mechanism - Abstract View (cont.) ??? 9

Attention Mechanism - Abstract View (cont.) • For large values of 𝑒 $ , the dot products grow large in magnitude, pushing the 𝑡𝑝𝑔𝑢𝑛𝑏𝑦 function into regions where it has extremely small gradients Jay Alammar, “The Illustrated Transformer” Vaswani et al. "Attention Is All You Need", 2017 10 http://jalammar.github.io/illustrated-transformer/ (5/20/2019)

Self Attention • AKA intra-attention • An attention mechanism relating different positions of a single sequence => Q, K, V are derived from a single sequence • Check the case when – 𝑅 . = 𝑋 1 𝑌 . – 𝐿 4 , … , 𝐿 7 = 𝑋 8 𝑌 4 , … , 𝑋 8 𝑌 7 – V 4 , … , V : = 𝑋 ; 𝑌 4 , … , 𝑋 ; 𝑌 7 11 Jay Alammar, “The Illustrated Transformer” http://jalammar.github.io/illustrated-transformer/ (5/20/2019)

Multi-head attention • Allows the model to – jointly attend to information – from different representation subspaces – at different positions [modified] Jay Alammar, “The Illustrated Transformer” Vaswani et al. "Attention Is All You Need", 2017 12 http://jalammar.github.io/illustrated-transformer/ (5/20/2019)

Multi-head Self Attention Jay Alammar, “The Illustrated Transformer” http://jalammar.github.io/illustrated-transformer/ (5/20/2019) 13

Bonus: Attention Is All She Needs Gregory Jantz “Hungry for Attention: Is Your Cell Phone Use at Dinnertime Hurting Your Kids?”, https://www.huffpost.com/entry/cell-phone-use-at-dinnertime_n_5207272 2014

Attention Is All You Need Advantages: • Replace LSTMs with a lot of attention! Less complex • – State-of-the art results Can be paralleled, faster • – Much less computation for training Easy to learn distant dependency • Vaswani et al. "Attention Is All You Need", 2017 15

Transformer’s Behavior • Encoding + First decoding step [Link to gif] Jay Alammar, “The Illustrated Transformer” 16 Vaswani et al. "Attention Is All You Need", 2017 http://jalammar.github.io/illustrated-transformer/ (5/20/2019)

Transformer’s Behavior (cont.) • Decoding [Link to gif] Jay Alammar, “The Illustrated Transformer” 17 Vaswani et al. "Attention Is All You Need", 2017 http://jalammar.github.io/illustrated-transformer/ (5/20/2019)

Transformer architecture • The core of it – Multi-head attention – Positional encoding [Link to gif] Jakob Uszkoreit "Transformer: A Novel Neural Network Architecture for Language 18 Understanding", https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html Vaswani et al. "Attention Is All You Need", 2017

Transformer architecture (cont.) • Encoder – Input embedding (like word2vec) – Positional encoding – Multi-head self attentions – Feed-forward with residual links • Decoder – Output embedding (like word2vec) – Positional encoding – Multi-head self attentions – Multi-head encoder-decoder attentions – Feed-forward with residual links • Output – Linear + Softmax 19 Vaswani et al. "Attention Is All You Need", 2017

Transformer architecture (cont.) • Output – Linear + Softmax 20 Vaswani et al. "Attention Is All You Need", 2017

Transformer architecture (cont.) • Encoder and Decoder Jay Alammar, “The Illustrated Transformer” 21 http://jalammar.github.io/illustrated-transformer/ (5/20/2019) Vaswani et al. "Attention Is All You Need", 2017

Transformer architecture (cont.) • Feed-forward Layers • Residual links • Batch-norm • Dropout 22 Vaswani et al. "Attention Is All You Need", 2017

Transformer architecture (cont.) • Attention is all it needs 23 Vaswani et al. "Attention Is All You Need", 2017

Transformer architecture (cont.) • [Multi-head] attention is all it needs 24 Vaswani et al. "Attention Is All You Need", 2017

Transformer architecture (cont.) • Two types of attention is all it needs :D Remember signature of multi-head attention 25 Vaswani et al. "Attention Is All You Need", 2017

Transformer architecture (cont.) • Embeddings – Just a lookup table: 26 Vaswani et al. "Attention Is All You Need", 2017

Transformer architecture (cont.) • Positional Encoding • It would allow the model to easily learn to attend by relative positions sin (𝑞𝑝𝑡 + 𝑙) sin 𝑞𝑝𝑡 cos 𝑙 + cos 𝑞𝑝𝑡 sin (𝑙) = • cos (𝑞𝑝𝑡 + 𝑙) cos 𝑞𝑝𝑡 cos 𝑙 − sin 𝑞𝑝𝑡 sin (𝑙) … … 27 Alexander Rush, “The Annotated Transformer” http://nlp.seas.harvard.edu/2018/04/03/attention.html (5/20/2019) Vaswani et al. "Attention Is All You Need", 2017

Transformer architecture (cont.) • A 2 tier transformer network 28 Jay Alammar, “The Illustrated Transformer” http://jalammar.github.io/illustrated-transformer/ (5/20/2019) Vaswani et al. "Attention Is All You Need", 2017

Transformer’s Behavior • Encoding + First decoding step [Link to gif] Jay Alammar, “The Illustrated Transformer” 29 Vaswani et al. "Attention Is All You Need", 2017 http://jalammar.github.io/illustrated-transformer/ (5/20/2019)

Transformer’s Behavior (cont.) • Decoding [Link to gif] Jay Alammar, “The Illustrated Transformer” 30 Vaswani et al. "Attention Is All You Need", 2017 http://jalammar.github.io/illustrated-transformer/ (5/20/2019)

Complexity • Advantages: – Less complex – Can be paralleled, faster – Easy to learn distant dependency 31 Vaswani et al. "Attention Is All You Need", 2017

Interpretability • Attention mechanism in the encoder self-attention in layer 5 of 6 32 Vaswani et al. "Attention Is All You Need", 2017

Interpretability (cont.) • Two heads in the encoder self-attention in layer 5 of 6 Vaswani et al. "Attention Is All You Need", 2017 33

References • Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems . 2017. • Alammar, Jay. “The Illustrated Transformer.” The Illustrated Transformer – Jay Alammar – Visualizing Machine Learning One Concept at a Time , 27 June 2018, jalammar.github.io/illustrated-transformer/. • Zhang, Shiyue. “Attention Is All You Need - Ppt Download.” SlidePlayer , 20 June 2017, slideplayer.com/slide/13789541/. • Kurbanov, Rauf. “Attention Is All You Need.” JetBrains Research , 27 Jan. 2019, research.jetbrains.org/files/material/5ace635c03259.pdf. • Polosukhin, Illia. “Attention Is All You Need.” LinkedIn SlideShare , 26 Sept. 2017, www.slideshare.net/ilblackdragon/attention-is-all-you-need. • Rush , Alexander. The Annotated Transformer , 3 Apr. 2018, nlp.seas.harvard.edu/2018/04/03/attention.html. • Uszkoreit, Jakob. “Transformer: A Novel Neural Network Architecture for Language Understanding.” Google AI Blog , 31 Aug. 2017, ai.googleblog.com/2017/08/transformer-novel- neural-network.html. 34

Q&A 35

Thanks for your attention! 𝑍𝑝𝑣𝑠 𝐵𝑢𝑢𝑓𝑜𝑢𝑗𝑝𝑜 = 𝑇𝑝𝑔𝑢𝑛𝑏𝑦 𝑍𝑝𝑣 [𝑄𝑠𝑓𝑡𝑓𝑜𝑢𝑏𝑢𝑗𝑝𝑜 | 𝐵𝑜𝑧𝑢ℎ𝑗𝑜𝑕 𝑓𝑚𝑡𝑓] V [𝑄𝑠𝑓𝑡𝑓𝑜𝑢𝑏𝑢𝑗𝑝𝑜𝑜 | 𝐵𝑜𝑧𝑢ℎ𝑗𝑜𝑕 𝑓𝑚𝑡𝑓] 36 36

Transformer Networks Amir Ali Moinfar - M. Soleymani Deep Learning - PowerPoint PPT Presentation

Transformer Networks Amir Ali Moinfar - M. Soleymani Deep Learning Sharif University of Technology Spring 2019 1 The simple translation model Embedding each word (word2vec, trainable, ) Some Tricks: Teacher forcing

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

West Cape Transformer Replacement Purpose Purchase a 50 MVA Load Tap Changer power

IEEE Transformer Committee PC57.167 Distribution Transformer Monitoring - User Mark Scarborough

Transformer Program Trevor Foster Electrical Engineering Manager Calpine Transformer Program 1

Electronic DC Transformer Pavol Bauer Learning objectives What is an electronic DC

Magnetics Design 3.1 Important magnetic equations 3.2 Magnetic losses 3.3 Transformer 3.3.1

Transformer Sequence Models and Sequence Applications (Machine Translation, Speech Recognition)

Transformer Sequence Models CSE354 - Spring 2020 Natural Language Processing Most NLP Tasks.

Utility Task Force Corrosion Pictures for IEEE C57.12.24 Fall 2018 & Submersible Transformer

Matrix Transformer Building Blocks for High Frequency Applications James Lau CWS (Coil

Transformer at base of Tower Critical link to successful Wind Turbine connection to the grid

The Voltage Waveform of Transformer Core Halves with Magnetization and an Air Gap ... and Maybe

High Voltage Pad-mount Transformer SEPTEMBER 5 - 7, 2018 What is the High Voltage Padmount

Transformer Maintenance October 17, 2013 Prepared for: VELCO Operating Committee 10/17/2013

Improving Transformer Optimization Through Better Initialization Xiao Shi Huang, Felipe Perez,

Non-Autoregressive Decoding Xiachong Feng Outline Transformer The Importance of

Some examples of issue- definitions and their relation to the politics of attention POLI 195

Spatial Transformers in Feed-Forward Networks Max Jaederberg, Karen Simonyan, Andrew Zisserman

Investigating positional information in the Transformer Group 9 Outline Background &

Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention Navid

For City and County Recipients September 1, 2020 For State and Territory Recipients

HEALTH INFORMATION TECHNOLOGY INFORMATION SESSION TRINITY RIVER EAST CAMPUS TARRANT COUNTY

5/29/2018 In Search of Meaning: Autism Spectrum Disorder and Reading Comprehension Marci

BIRTH OUTCOMES Premature birth can lead to lifelong developmental and intellectual disabilities

Transformer Networks Amir Ali Moinfar - M. Soleymani Deep Learning - PowerPoint PPT Presentation

Transformer Networks Amir Ali Moinfar - M. Soleymani Deep Learning Sharif University of Technology Spring 2019 1 The simple translation model Embedding each word (word2vec, trainable, ) Some Tricks: Teacher forcing

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

West Cape Transformer Replacement Purpose Purchase a 50 MVA Load Tap Changer power

IEEE Transformer Committee PC57.167 Distribution Transformer Monitoring - User Mark Scarborough

Transformer Program Trevor Foster Electrical Engineering Manager Calpine Transformer Program 1

Electronic DC Transformer Pavol Bauer Learning objectives What is an electronic DC

Magnetics Design 3.1 Important magnetic equations 3.2 Magnetic losses 3.3 Transformer 3.3.1

Transformer Sequence Models and Sequence Applications (Machine Translation, Speech Recognition)

Transformer Sequence Models CSE354 - Spring 2020 Natural Language Processing Most NLP Tasks.

Utility Task Force Corrosion Pictures for IEEE C57.12.24 Fall 2018 &amp; Submersible Transformer

Matrix Transformer Building Blocks for High Frequency Applications James Lau CWS (Coil

Transformer at base of Tower Critical link to successful Wind Turbine connection to the grid

The Voltage Waveform of Transformer Core Halves with Magnetization and an Air Gap ... and Maybe

High Voltage Pad-mount Transformer SEPTEMBER 5 - 7, 2018 What is the High Voltage Padmount

Transformer Maintenance October 17, 2013 Prepared for: VELCO Operating Committee 10/17/2013

Improving Transformer Optimization Through Better Initialization Xiao Shi Huang*, Felipe Perez*,

Non-Autoregressive Decoding Xiachong Feng Outline Transformer The Importance of

Some examples of issue- definitions and their relation to the politics of attention POLI 195

Spatial Transformers in Feed-Forward Networks Max Jaederberg, Karen Simonyan, Andrew Zisserman

Investigating positional information in the Transformer Group 9 Outline Background &amp;

Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention Navid

For City and County Recipients September 1, 2020 For State and Territory Recipients

HEALTH INFORMATION TECHNOLOGY INFORMATION SESSION TRINITY RIVER EAST CAMPUS TARRANT COUNTY

5/29/2018 In Search of Meaning: Autism Spectrum Disorder and Reading Comprehension Marci

BIRTH OUTCOMES Premature birth can lead to lifelong developmental and intellectual disabilities

Utility Task Force Corrosion Pictures for IEEE C57.12.24 Fall 2018 & Submersible Transformer

Improving Transformer Optimization Through Better Initialization Xiao Shi Huang, Felipe Perez,

Investigating positional information in the Transformer Group 9 Outline Background &