Transformer Networks
Amir Ali Moinfar - M. Soleymani Deep Learning Sharif University of Technology Spring 2019 1Transformer Networks Amir Ali Moinfar - M. Soleymani Deep Learning - - PowerPoint PPT Presentation
Transformer Networks Amir Ali Moinfar - M. Soleymani Deep Learning - - PowerPoint PPT Presentation
Transformer Networks Amir Ali Moinfar - M. Soleymani Deep Learning Sharif University of Technology Spring 2019 1 The simple translation model Embedding each word (word2vec, trainable, ) Some Tricks: Teacher forcing
The “simple” translation model
- Embedding each word (word2vec, trainable, …)
- Some Tricks:
Problems with this framework
- All the information about the input is embedded into a single vector
- Particularly if the input is long
- Parallelization?
- Problems in back propagation through sequence
( )
Parallelization: Convolutional Models
- Some work:
- Limited by size of
- Maximum path length:
Removing bottleneck: Attention Mechanism
- Compute a weighted
- Weights are functions
- f current output
- The weights are a
Attention Effect in machine translation
6 Bahdanau et al. "Neural Machine Translation by Jointly Learning to Align and Translate", 2014- Left: Normal RNNs and long sentences
- Right: Attention map in machine translation
( )
RNNs with Attention for VQA
7- Each hidden output
- f LSTM selects a
Attention Mechanism - Abstract View
8- A Lookup Mechanism
Attention Mechanism - Abstract View (cont.)
9???
Attention Mechanism - Abstract View (cont.)
- For large values of 𝑒$, the dot products
Self Attention
- AKA intra-attention
- An attention mechanism relating
- Check the case when
Multi-head attention
- Allows the model to
Multi-head Self Attention
13 Jay Alammar, “The Illustrated Transformer” http://jalammar.github.io/illustrated-transformer/ (5/20/2019)Bonus: Attention Is All She Needs
Gregory Jantz “Hungry for Attention: Is Your Cell Phone Use at Dinnertime Hurting Your Kids?”, https://www.huffpost.com/entry/cell-phone-use-at-dinnertime_n_5207272 2014Attention Is All You Need
- Replace LSTMs with a lot of attention!
- Less complex
- Can be paralleled, faster
- Easy to learn distant dependency
Transformer’s Behavior
16 Vaswani et al. "Attention Is All You Need", 2017 Jay Alammar, “The Illustrated Transformer” http://jalammar.github.io/illustrated-transformer/ (5/20/2019)- Encoding + First decoding step
Transformer’s Behavior (cont.)
17 Vaswani et al. "Attention Is All You Need", 2017 Jay Alammar, “The Illustrated Transformer” http://jalammar.github.io/illustrated-transformer/ (5/20/2019)- Decoding
Transformer architecture
- The core of it
Transformer architecture (cont.)
- Encoder
- Decoder
- Output
Transformer architecture (cont.)
- Output
Transformer architecture (cont.)
- Encoder and Decoder
Transformer architecture (cont.)
- Feed-forward Layers
- Residual links
- Batch-norm
- Dropout
Transformer architecture (cont.)
- Attention is all it needs
Transformer architecture (cont.)
- [Multi-head] attention is all it needs
Transformer architecture (cont.)
- Two types of attention is all it needs :D
Transformer architecture (cont.)
- Embeddings
Transformer architecture (cont.)
- Positional Encoding
- It would allow the model to easily learn to attend by relative positions
- sin
Transformer architecture (cont.)
28 Vaswani et al. "Attention Is All You Need", 2017- A 2 tier transformer network
Transformer’s Behavior
29 Vaswani et al. "Attention Is All You Need", 2017 Jay Alammar, “The Illustrated Transformer” http://jalammar.github.io/illustrated-transformer/ (5/20/2019)- Encoding + First decoding step
Transformer’s Behavior (cont.)
30 Vaswani et al. "Attention Is All You Need", 2017 Jay Alammar, “The Illustrated Transformer” http://jalammar.github.io/illustrated-transformer/ (5/20/2019)- Decoding
Complexity
- Advantages:
Interpretability
- Attention mechanism in the encoder self-attention in layer 5 of 6
Interpretability (cont.)
- Two heads in the encoder self-attention in layer 5 of 6
References
- Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing
- systems. 2017.
- Alammar, Jay. “The Illustrated Transformer.” The Illustrated Transformer – Jay Alammar –
- Zhang, Shiyue. “Attention Is All You Need - Ppt Download.” SlidePlayer, 20 June 2017,
- Kurbanov, Rauf. “Attention Is All You Need.” JetBrains Research, 27 Jan. 2019,
- Polosukhin, Illia. “Attention Is All You Need.” LinkedIn SlideShare, 26 Sept. 2017,
- Rush , Alexander. The Annotated Transformer, 3 Apr. 2018,
- Uszkoreit, Jakob. “Transformer: A Novel Neural Network Architecture for Language
Q&A
35Thanks for your attention!
𝑍𝑝𝑣𝑠 𝐵𝑢𝑢𝑓𝑜𝑢𝑗𝑝𝑜 = 𝑇𝑝𝑔𝑢𝑛𝑏𝑦 𝑍𝑝𝑣 [𝑄𝑠𝑓𝑡𝑓𝑜𝑢𝑏𝑢𝑗𝑝𝑜 | 𝐵𝑜𝑧𝑢ℎ𝑗𝑜 𝑓𝑚𝑡𝑓]V 36 [𝑄𝑠𝑓𝑡𝑓𝑜𝑢𝑏𝑢𝑗𝑝𝑜𝑜 | 𝐵𝑜𝑧𝑢ℎ𝑗𝑜 𝑓𝑚𝑡𝑓] 36