Transformer Networks Amir Ali Moinfar - M. Soleymani Deep Learning - - PowerPoint PPT Presentation

transformer networks
SMART_READER_LITE
LIVE PREVIEW

Transformer Networks Amir Ali Moinfar - M. Soleymani Deep Learning - - PowerPoint PPT Presentation

Transformer Networks Amir Ali Moinfar - M. Soleymani Deep Learning Sharif University of Technology Spring 2019 1 The simple translation model Embedding each word (word2vec, trainable, ) Some Tricks: Teacher forcing


slide-1
SLIDE 1

Transformer Networks

Amir Ali Moinfar - M. Soleymani Deep Learning Sharif University of Technology Spring 2019 1
slide-2
SLIDE 2

The “simple” translation model

  • Embedding each word (word2vec, trainable, …)
  • Some Tricks:
– Teacher forcing – Reversing the input 2 This slide has been adapted from Bhiksha Raj, 11-785, CMU 2019
slide-3
SLIDE 3

Problems with this framework

  • All the information about the input is embedded into a single vector
– Last hidden node is “overloaded” with information
  • Particularly if the input is long
  • Parallelization?
  • Problems in back propagation through sequence
3 This slide has been adapted from Bhiksha Raj, 11-785, CMU 2019
slide-4
SLIDE 4

( )

Parallelization: Convolutional Models

  • Some work:
– Neural GPU – ByteNet – ConvS2S
  • Limited by size of
convolution
  • Maximum path length:
– 𝑀𝑝𝑕$ 𝑜 4 Kalchbrenner et al. “Neural Machine Translation in Linear Time”, 2017
slide-5
SLIDE 5

Removing bottleneck: Attention Mechanism

  • Compute a weighted
combination of all the hidden outputs into a single vector
  • Weights are functions
  • f current output
state
  • The weights are a
distribution over the input (sum to 1) 5 This slide has been adapted from Bhiksha Raj, 11-785, CMU 2019
slide-6
SLIDE 6

Attention Effect in machine translation

6 Bahdanau et al. "Neural Machine Translation by Jointly Learning to Align and Translate", 2014
  • Left: Normal RNNs and long sentences
  • Right: Attention map in machine translation
slide-7
SLIDE 7

( )

RNNs with Attention for VQA

7
  • Each hidden output
  • f LSTM selects a
part of image to look at Zhu et al. “Visual7W: Grounded Question Answering in Images” 2016
slide-8
SLIDE 8

Attention Mechanism - Abstract View

8
  • A Lookup Mechanism
– Query – Key – Value
slide-9
SLIDE 9

Attention Mechanism - Abstract View (cont.)

9

???

slide-10
SLIDE 10

Attention Mechanism - Abstract View (cont.)

  • For large values of 𝑒$, the dot products
grow large in magnitude, pushing the 𝑡𝑝𝑔𝑢𝑛𝑏𝑦 function into regions where it has extremely small gradients 10 Vaswani et al. "Attention Is All You Need", 2017 Jay Alammar, “The Illustrated Transformer” http://jalammar.github.io/illustrated-transformer/ (5/20/2019)
slide-11
SLIDE 11

Self Attention

  • AKA intra-attention
  • An attention mechanism relating
different positions of a single sequence => Q, K, V are derived from a single sequence
  • Check the case when
– 𝑅. = 𝑋 1𝑌. – 𝐿4, … , 𝐿7 = 𝑋 8𝑌4, … , 𝑋 8𝑌7 – V 4, … , V: = 𝑋 ;𝑌4, … , 𝑋 ;𝑌7 11 Jay Alammar, “The Illustrated Transformer” http://jalammar.github.io/illustrated-transformer/ (5/20/2019)
slide-12
SLIDE 12

Multi-head attention

  • Allows the model to
– jointly attend to information – from different representation subspaces – at different positions 12 Vaswani et al. "Attention Is All You Need", 2017 [modified] Jay Alammar, “The Illustrated Transformer” http://jalammar.github.io/illustrated-transformer/ (5/20/2019)
slide-13
SLIDE 13

Multi-head Self Attention

13 Jay Alammar, “The Illustrated Transformer” http://jalammar.github.io/illustrated-transformer/ (5/20/2019)
slide-14
SLIDE 14

Bonus: Attention Is All She Needs

Gregory Jantz “Hungry for Attention: Is Your Cell Phone Use at Dinnertime Hurting Your Kids?”, https://www.huffpost.com/entry/cell-phone-use-at-dinnertime_n_5207272 2014
slide-15
SLIDE 15

Attention Is All You Need

  • Replace LSTMs with a lot of attention!
– State-of-the art results – Much less computation for training 15 Advantages:
  • Less complex
  • Can be paralleled, faster
  • Easy to learn distant dependency
Vaswani et al. "Attention Is All You Need", 2017
slide-16
SLIDE 16

Transformer’s Behavior

16 Vaswani et al. "Attention Is All You Need", 2017 Jay Alammar, “The Illustrated Transformer” http://jalammar.github.io/illustrated-transformer/ (5/20/2019)
  • Encoding + First decoding step
[Link to gif]
slide-17
SLIDE 17

Transformer’s Behavior (cont.)

17 Vaswani et al. "Attention Is All You Need", 2017 Jay Alammar, “The Illustrated Transformer” http://jalammar.github.io/illustrated-transformer/ (5/20/2019)
  • Decoding
[Link to gif]
slide-18
SLIDE 18

Transformer architecture

  • The core of it
– Multi-head attention – Positional encoding 18 Vaswani et al. "Attention Is All You Need", 2017 Jakob Uszkoreit "Transformer: A Novel Neural Network Architecture for Language Understanding", https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html [Link to gif]
slide-19
SLIDE 19

Transformer architecture (cont.)

  • Encoder
– Input embedding (like word2vec) – Positional encoding – Multi-head self attentions – Feed-forward with residual links
  • Decoder
– Output embedding (like word2vec) – Positional encoding – Multi-head self attentions – Multi-head encoder-decoder attentions – Feed-forward with residual links
  • Output
– Linear + Softmax 19 Vaswani et al. "Attention Is All You Need", 2017
slide-20
SLIDE 20

Transformer architecture (cont.)

  • Output
– Linear + Softmax 20 Vaswani et al. "Attention Is All You Need", 2017
slide-21
SLIDE 21

Transformer architecture (cont.)

  • Encoder and Decoder
21 Vaswani et al. "Attention Is All You Need", 2017 Jay Alammar, “The Illustrated Transformer” http://jalammar.github.io/illustrated-transformer/ (5/20/2019)
slide-22
SLIDE 22

Transformer architecture (cont.)

  • Feed-forward Layers
  • Residual links
  • Batch-norm
  • Dropout
22 Vaswani et al. "Attention Is All You Need", 2017
slide-23
SLIDE 23

Transformer architecture (cont.)

  • Attention is all it needs
23 Vaswani et al. "Attention Is All You Need", 2017
slide-24
SLIDE 24

Transformer architecture (cont.)

  • [Multi-head] attention is all it needs
24 Vaswani et al. "Attention Is All You Need", 2017
slide-25
SLIDE 25

Transformer architecture (cont.)

  • Two types of attention is all it needs :D
25 Vaswani et al. "Attention Is All You Need", 2017 Remember signature of multi-head attention
slide-26
SLIDE 26

Transformer architecture (cont.)

  • Embeddings
– Just a lookup table: 26 Vaswani et al. "Attention Is All You Need", 2017
slide-27
SLIDE 27

Transformer architecture (cont.)

  • Positional Encoding
  • It would allow the model to easily learn to attend by relative positions
  • sin
(𝑞𝑝𝑡 + 𝑙) cos (𝑞𝑝𝑡 + 𝑙) … = sin 𝑞𝑝𝑡 cos 𝑙 + cos 𝑞𝑝𝑡 sin (𝑙) cos 𝑞𝑝𝑡 cos 𝑙 − sin 𝑞𝑝𝑡 sin (𝑙) … 27 Vaswani et al. "Attention Is All You Need", 2017 Alexander Rush, “The Annotated Transformer” http://nlp.seas.harvard.edu/2018/04/03/attention.html (5/20/2019)
slide-28
SLIDE 28

Transformer architecture (cont.)

28 Vaswani et al. "Attention Is All You Need", 2017
  • A 2 tier transformer network
Jay Alammar, “The Illustrated Transformer” http://jalammar.github.io/illustrated-transformer/ (5/20/2019)
slide-29
SLIDE 29

Transformer’s Behavior

29 Vaswani et al. "Attention Is All You Need", 2017 Jay Alammar, “The Illustrated Transformer” http://jalammar.github.io/illustrated-transformer/ (5/20/2019)
  • Encoding + First decoding step
[Link to gif]
slide-30
SLIDE 30

Transformer’s Behavior (cont.)

30 Vaswani et al. "Attention Is All You Need", 2017 Jay Alammar, “The Illustrated Transformer” http://jalammar.github.io/illustrated-transformer/ (5/20/2019)
  • Decoding
[Link to gif]
slide-31
SLIDE 31

Complexity

  • Advantages:
– Less complex – Can be paralleled, faster – Easy to learn distant dependency 31 Vaswani et al. "Attention Is All You Need", 2017
slide-32
SLIDE 32

Interpretability

  • Attention mechanism in the encoder self-attention in layer 5 of 6
32 Vaswani et al. "Attention Is All You Need", 2017
slide-33
SLIDE 33

Interpretability (cont.)

  • Two heads in the encoder self-attention in layer 5 of 6
33 Vaswani et al. "Attention Is All You Need", 2017
slide-34
SLIDE 34

References

  • Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing
  • systems. 2017.
  • Alammar, Jay. “The Illustrated Transformer.” The Illustrated Transformer – Jay Alammar –
Visualizing Machine Learning One Concept at a Time, 27 June 2018, jalammar.github.io/illustrated-transformer/.
  • Zhang, Shiyue. “Attention Is All You Need - Ppt Download.” SlidePlayer, 20 June 2017,
slideplayer.com/slide/13789541/.
  • Kurbanov, Rauf. “Attention Is All You Need.” JetBrains Research, 27 Jan. 2019,
research.jetbrains.org/files/material/5ace635c03259.pdf.
  • Polosukhin, Illia. “Attention Is All You Need.” LinkedIn SlideShare, 26 Sept. 2017,
www.slideshare.net/ilblackdragon/attention-is-all-you-need.
  • Rush , Alexander. The Annotated Transformer, 3 Apr. 2018,
nlp.seas.harvard.edu/2018/04/03/attention.html.
  • Uszkoreit, Jakob. “Transformer: A Novel Neural Network Architecture for Language
Understanding.” Google AI Blog, 31 Aug. 2017, ai.googleblog.com/2017/08/transformer-novel- neural-network.html. 34
slide-35
SLIDE 35

Q&A

35
slide-36
SLIDE 36

Thanks for your attention!

𝑍𝑝𝑣𝑠 𝐵𝑢𝑢𝑓𝑜𝑢𝑗𝑝𝑜 = 𝑇𝑝𝑔𝑢𝑛𝑏𝑦 𝑍𝑝𝑣 [𝑄𝑠𝑓𝑡𝑓𝑜𝑢𝑏𝑢𝑗𝑝𝑜 | 𝐵𝑜𝑧𝑢ℎ𝑗𝑜𝑕 𝑓𝑚𝑡𝑓]V 36 [𝑄𝑠𝑓𝑡𝑓𝑜𝑢𝑏𝑢𝑗𝑝𝑜𝑜 | 𝐵𝑜𝑧𝑢ℎ𝑗𝑜𝑕 𝑓𝑚𝑡𝑓] 36