Transformers are RNNs: Fast Autoregressive Transformers with Linear - - PowerPoint PPT Presentation

transformers are rnns fast autoregressive transformers
SMART_READER_LITE
LIVE PREVIEW

Transformers are RNNs: Fast Autoregressive Transformers with Linear - - PowerPoint PPT Presentation

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas and Fran cois Fleuret ICML, July 2020 https://linear-transformers.com/ Funded by Transformers are performant


slide-1
SLIDE 1

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas and Fran¸ cois Fleuret ICML, July 2020 https://linear-transformers.com/

Funded by

slide-2
SLIDE 2

Transformers are performant

Transformer models have demonstrated impressive performance on ◮ NLP (Vaswani et al., 2017; Devlin et al., 2019; Dai et al., 2019; Yang et al., 2019; Radford

et al., 2019)

◮ Neural Machine Translation ◮ Question Answering ◮ Textual Entailment

  • A. Katharopoulos

Transformers are RNNs 2/17

slide-3
SLIDE 3

Transformers are performant

Transformer models have demonstrated impressive performance on ◮ NLP (Vaswani et al., 2017; Devlin et al., 2019; Dai et al., 2019; Yang et al., 2019; Radford

et al., 2019)

◮ Neural Machine Translation ◮ Question Answering ◮ Textual Entailment

◮ Speech & audio processing (Sperber et al., 2018) ◮ Autoregressive image generation and general computer vision (Child et al., 2019;

Parmar et al., 2019; Carion et al., 2020; Cordonnier et al., 2020)

  • A. Katharopoulos

Transformers are RNNs 2/17

slide-4
SLIDE 4

Transformers are hard to scale

Self-attention computation and memory scales as O

  • N2

with respect to the sequence length.

1000 2000 3000 4000 Sequence Length 20 40 60 Time (milliseconds) 1000 2000 3000 4000 Sequence Length 1000 2000 GPU Memory (MB)

A single self-attention layer in an NVIDIA GTX 1080 Ti

  • A. Katharopoulos

Transformers are RNNs 3/17

slide-5
SLIDE 5

Our contributions in a nutshell

◮ A transformer model with linear complexity both for memory and computation during training

  • A. Katharopoulos

Transformers are RNNs 4/17

slide-6
SLIDE 6

Our contributions in a nutshell

◮ A transformer model with linear complexity both for memory and computation during training ◮ A transformer model with linear computational complexity and constant memory for autoregressive inference

  • A. Katharopoulos

Transformers are RNNs 4/17

slide-7
SLIDE 7

Our contributions in a nutshell

◮ A transformer model with linear complexity both for memory and computation during training ◮ A transformer model with linear computational complexity and constant memory for autoregressive inference ◮ Unravel the relation between transformers and RNNs

  • A. Katharopoulos

Transformers are RNNs 4/17

slide-8
SLIDE 8

Definition of a transformer

  • A. Katharopoulos

Transformers are RNNs 5/17

slide-9
SLIDE 9

Definition of a transformer

  • A. Katharopoulos

Transformers are RNNs 5/17

slide-10
SLIDE 10

Definition of a transformer

  • A. Katharopoulos

Transformers are RNNs 5/17

slide-11
SLIDE 11

Self-Attention

The commonly used attention mechanism is the scaled dot product attention Q = XWQ K = XWK V = XWV Al(X) = V ′ = softmax QK T √ D

  • V
  • A. Katharopoulos

Transformers are RNNs 6/17

slide-12
SLIDE 12

Self-Attention

The commonly used attention mechanism is the scaled dot product attention Q = XWQ K = XWK V = XWV Al(X) = V ′ = softmax QK T √ D

  • V
  • A. Katharopoulos

Transformers are RNNs 6/17

slide-13
SLIDE 13

Self-Attention

The commonly used attention mechanism is the scaled dot product attention Q = XWQ K = XWK V = XWV Al(X) = V ′ = softmax QK T √ D Quadratic complexity

  • V
  • A. Katharopoulos

Transformers are RNNs 6/17

slide-14
SLIDE 14

Linear Attention

What if we write the self-attention using an arbitrary similarity score? V ′

i =

N

j=1 sim (Qi, Kj) Vj

N

j=1 sim (Qi, Kj)

  • A. Katharopoulos

Transformers are RNNs 7/17

slide-15
SLIDE 15

Linear Attention

What if this similarity is a kernel, namely sim (a, b) = φ (a)T φ (b)? V ′

i =

N

j=1 sim (Qi, Kj) Vj

N

j=1 sim (Qi, Kj)

= N

j=1 φ (Qi)T φ (Kj) Vj

N

j=1 φ (Qi)T φ (Kj)

Kernelization

  • A. Katharopoulos

Transformers are RNNs 7/17

slide-16
SLIDE 16

Linear Attention

Matrix products are associative which makes the attention computation O (N) with respect to the sequence length. V ′

i =

N

j=1 sim (Qi, Kj) Vj

N

j=1 sim (Qi, Kj)

= N

j=1 φ (Qi)T φ (Kj) Vj

N

j=1 φ (Qi)T φ (Kj)

Kernelization = φ (Qi)T N

j=1 φ (Kj) V T j

φ (Qi)T N

j=1 φ (Kj)

Associativity property

  • A. Katharopoulos

Transformers are RNNs 7/17

slide-17
SLIDE 17

Causal Masking

Causal masking is used to efficiently train autoregressive transformers.

  • A. Katharopoulos

Transformers are RNNs 8/17

slide-18
SLIDE 18

Causal Masking

Causal masking is used to efficiently train autoregressive transformers. Non-autoregressive V ′

i =

N

j=1 sim (Qi, Kj) Vj

N

j=1 sim (Qi, Kj)

Autoregressive V ′

i =

i

j=1 sim (Qi, Kj) Vj

i

j=1 sim (Qi, Kj)

  • A. Katharopoulos

Transformers are RNNs 8/17

slide-19
SLIDE 19

Causal Masking

Causal masking is used to efficiently train autoregressive transformers. Non-autoregressive V ′

i =

φ (Qi)T N

j=1 φ (Kj) V T j

φ (Qi)T N

j=1 φ (Kj)

Autoregressive V ′

i =

φ (Qi)T i

j=1 φ (Kj) V T j

φ (Qi)T i

j=1 φ (Kj)

  • A. Katharopoulos

Transformers are RNNs 8/17

slide-20
SLIDE 20

Causal Masking

Causal masking is used to efficiently train autoregressive transformers. Non-autoregressive V ′

i =

φ (Qi)T

S

  • N

j=1 φ (Kj) V T j

φ (Qi)T N

j=1 φ (Kj)

  • Z

Autoregressive V ′

i =

φ (Qi)T

Si

  • i

j=1 φ (Kj) V T j

φ (Qi)T i

j=1 φ (Kj)

  • Zi
  • A. Katharopoulos

Transformers are RNNs 8/17

slide-21
SLIDE 21

Causal Masking

Causal masking is used to efficiently train autoregressive transformers. Non-autoregressive V ′

i =

φ (Qi)T

S

  • N

j=1 φ (Kj) V T j

φ (Qi)T N

j=1 φ (Kj)

  • Z

Autoregressive V ′

i =

φ (Qi)T

Si

  • i

j=1 φ (Kj) V T j

φ (Qi)T i

j=1 φ (Kj)

  • Zi

Naive computation of Si and Zi results in quadratic complexity.

  • A. Katharopoulos

Transformers are RNNs 8/17

slide-22
SLIDE 22

Transformers are RNNs

Autoregressive transformers can be written as a function that receives an input xi, modifies the internal state {si−1, zi−1} and predicts an output yi.

  • A. Katharopoulos

Transformers are RNNs 9/17

slide-23
SLIDE 23

Transformers are RNNs

Autoregressive transformers can be written as a function that receives an input xi, modifies the internal state {si−1, zi−1} and predicts an output yi.

  • A. Katharopoulos

Transformers are RNNs 9/17

slide-24
SLIDE 24

Transformers are RNNs

Autoregressive transformers can be written as a function that receives an input xi, modifies the internal state {si−1, zi−1} and predicts an output yi.

  • A. Katharopoulos

Transformers are RNNs 9/17

slide-25
SLIDE 25

Transformers are RNNs

Autoregressive transformers can be written as a function that receives an input xi, modifies the internal state {si−1, zi−1} and predicts an output yi. Autoregressive inference with linear complexity and constant memory.

  • A. Katharopoulos

Transformers are RNNs 9/17

slide-26
SLIDE 26

Practical implications

◮ Our theoretical analysis holds for all transformers even when using infinite dimensional feature maps

  • A. Katharopoulos

Transformers are RNNs 10/17

slide-27
SLIDE 27

Practical implications

◮ Our theoretical analysis holds for all transformers even when using infinite dimensional feature maps ◮ We need a simple finite dimensional feature map to speed up computation

  • A. Katharopoulos

Transformers are RNNs 10/17

slide-28
SLIDE 28

Practical implications

◮ Our theoretical analysis holds for all transformers even when using infinite dimensional feature maps ◮ We need a simple finite dimensional feature map to speed up computation ◮ We derive the gradients as cumulative sums which allows for a significant speed-up

  • A. Katharopoulos

Transformers are RNNs 10/17

slide-29
SLIDE 29

Experimental setup

Baselines ◮ Softmax transformer (Vaswani et al., 2017) ◮ LSH attention from Reformer (Kitaev et al., 2020) Experiments ◮ Artificial benchmark for computational and memory requirements ◮ Autoregressive image generation on MNIST and CIFAR-10 ◮ Automatic speech recognition on Wall Street Journal

  • A. Katharopoulos

Transformers are RNNs 11/17

slide-30
SLIDE 30

Experimental setup

Baselines ◮ Softmax transformer (Vaswani et al., 2017) ◮ LSH attention from Reformer (Kitaev et al., 2020) Experiments ◮ Artificial benchmark for computational and memory requirements ◮ Autoregressive image generation on MNIST and CIFAR-10 ◮ Automatic speech recognition on Wall Street Journal

  • A. Katharopoulos

Transformers are RNNs 11/17

slide-31
SLIDE 31

Experimental setup

Baselines ◮ Softmax transformer (Vaswani et al., 2017) ◮ LSH attention from Reformer (Kitaev et al., 2020) Experiments ◮ Artificial benchmark for computational and memory requirements ◮ Autoregressive image generation on MNIST and CIFAR-10 ◮ Automatic speech recognition on Wall Street Journal

  • A. Katharopoulos

Transformers are RNNs 11/17

slide-32
SLIDE 32

Benchmark

210 212 214 216 Sequence Length 100 101 102 Time (milliseconds) 210 212 214 216 Sequence Length 101 102 103 GPU Memory (MB) softmax lsh-1 lsh-4 lsh-8 linear (ours)

  • A. Katharopoulos

Transformers are RNNs 12/17

slide-33
SLIDE 33

Benchmark

210 212 214 216 Sequence Length 100 101 102 Time (milliseconds) 210 212 214 216 Sequence Length 101 102 103 GPU Memory (MB) softmax lsh-1 lsh-4 lsh-8 linear (ours)

  • A. Katharopoulos

Transformers are RNNs 12/17

slide-34
SLIDE 34

Autoregressive image generation

Unconditional samples after 250 epochs on MNIST

Ours (0.644 bpd) Softmax (0.621 bpd) LSH-1 (0.745 bpd) LSH-4 (0.676 bpd)

Unconditional samples after 1 GPU week on CIFAR-10

Ours (3.40 bpd) Softmax (3.47 bpd) LSH-1 (3.39 bpd) LSH-4 (3.51 bpd)

  • A. Katharopoulos

Transformers are RNNs 13/17

slide-35
SLIDE 35

Autoregressive image generation

Unconditional samples after 250 epochs on MNIST

Ours (0.644 bpd) Softmax (0.621 bpd) LSH-1 (0.745 bpd) LSH-4 (0.676 bpd)

Unconditional samples after 1 GPU week on CIFAR-10

Ours (3.40 bpd) Softmax (3.47 bpd) LSH-1 (3.39 bpd) LSH-4 (3.51 bpd)

  • A. Katharopoulos

Transformers are RNNs 13/17

slide-36
SLIDE 36

Autoregressive image generation

Unconditional samples after 250 epochs on MNIST

Ours (0.644 bpd) Softmax (0.621 bpd) LSH-1 (0.745 bpd) LSH-4 (0.676 bpd)

Unconditional samples after 1 GPU week on CIFAR-10

Ours (3.40 bpd) Softmax (3.47 bpd) LSH-1 (3.39 bpd) LSH-4 (3.51 bpd)

  • A. Katharopoulos

Transformers are RNNs 13/17

slide-37
SLIDE 37

Autoregressive image generation

MNIST

softmax lsh-1

  • urs

100 101 102 Images / second

CIFAR-10

softmax lsh-1

  • urs

10−2 10−1 100 101 Images / second

  • A. Katharopoulos

Transformers are RNNs 14/17

slide-38
SLIDE 38

Autoregressive image generation

MNIST

softmax lsh-1

  • urs

100 101 102 Images / second

CIFAR-10

softmax lsh-1

  • urs

10−2 10−1 100 101 Images / second

  • A. Katharopoulos

Transformers are RNNs 14/17

slide-39
SLIDE 39

Autoregressive image generation

MNIST

softmax lsh-1

  • urs

100 101 102 Images / second

CIFAR-10

softmax lsh-1

  • urs

10−2 10−1 100 101 Images / second

  • A. Katharopoulos

Transformers are RNNs 14/17

slide-40
SLIDE 40

Automatic speech recognition

Error rate relative to softmax

bi-lstm lsh-4

  • urs

0.0 0.5 1.0 1.5 2.0 Lower is better

Speedup relative to softmax

bi-lstm lsh-4

  • urs

1 2 3 Higher is better

  • A. Katharopoulos

Transformers are RNNs 15/17

slide-41
SLIDE 41

Automatic speech recognition

Error rate relative to softmax

bi-lstm lsh-4

  • urs

0.0 0.5 1.0 1.5 2.0 Lower is better

Speedup relative to softmax

bi-lstm lsh-4

  • urs

1 2 3 Higher is better

  • A. Katharopoulos

Transformers are RNNs 15/17

slide-42
SLIDE 42

Automatic speech recognition

Error rate relative to softmax

bi-lstm lsh-4

  • urs

0.0 0.5 1.0 1.5 2.0 Lower is better

Speedup relative to softmax

bi-lstm lsh-4

  • urs

1 2 3 Higher is better

  • A. Katharopoulos

Transformers are RNNs 15/17

slide-43
SLIDE 43

Automatic speech recognition

Error rate relative to softmax

bi-lstm lsh-4

  • urs

0.0 0.5 1.0 1.5 2.0 Lower is better

Speedup relative to softmax

bi-lstm lsh-4

  • urs

1 2 3 Higher is better

  • A. Katharopoulos

Transformers are RNNs 15/17

slide-44
SLIDE 44

Summary

◮ Kernel feature maps and matrix associativity yield an attention with linear complexity. ◮ Computing the key value matrix as a cumulative sum extends our efficient attention computation to the autoregressive case ◮ Using the RNN formulation to perform autoregressive inference requires constant memory and is many times faster

  • A. Katharopoulos

Transformers are RNNs 16/17

slide-45
SLIDE 45

Summary

◮ Kernel feature maps and matrix associativity yield an attention with linear complexity. ◮ Computing the key value matrix as a cumulative sum extends our efficient attention computation to the autoregressive case ◮ Using the RNN formulation to perform autoregressive inference requires constant memory and is many times faster

  • A. Katharopoulos

Transformers are RNNs 16/17

slide-46
SLIDE 46

Summary

◮ Kernel feature maps and matrix associativity yield an attention with linear complexity. ◮ Computing the key value matrix as a cumulative sum extends our efficient attention computation to the autoregressive case ◮ Using the RNN formulation to perform autoregressive inference requires constant memory and is many times faster

  • A. Katharopoulos

Transformers are RNNs 16/17

slide-47
SLIDE 47

Summary

◮ Kernel feature maps and matrix associativity yield an attention with linear complexity. ◮ Computing the key value matrix as a cumulative sum extends our efficient attention computation to the autoregressive case ◮ Using the RNN formulation to perform autoregressive inference requires constant memory and is many times faster

  • A. Katharopoulos

Transformers are RNNs 16/17

slide-48
SLIDE 48

Thank you for your time!

Check out the code at https://linear-transformers.com/ . from fast_transformers.builders import TransformerEncoderBuilder linear_bert = TransformerEncoderBuilder.from_kwargs( n_layers=12, n_heads=12, query_dimensions=64, value_dimensions=64, feed_forward_dimensions=3072, attention_type="linear", ).get() # dummy 4000 long sequence y = linear_bert(torch.rand(10, 4000, 768))

  • A. Katharopoulos

Transformers are RNNs 17/17

slide-49
SLIDE 49

References I

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan

  • Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length
  • context. In Proceedings of the 57th Annual Meeting of the Association for

Computational Linguistics, pages 2978–2988, Florence, Italy, July 2019. Association for Computational Linguistics.

  • A. Katharopoulos

Transformers are RNNs 1/3

slide-50
SLIDE 50

References II

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. Xlnet: Generalized autoregressive pretraining for language

  • understanding. CoRR, abs/1906.08237, 2019.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019. Matthias Sperber, Jan Niehues, Graham Neubig, Sebastian Stker, and Alex Waibel. Self-attentional acoustic models. In 19th Annual Conference of the International Speech Communication Association (InterSpeech 2018), Hyderabad, India, September 2018. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.

  • A. Katharopoulos

Transformers are RNNs 2/3

slide-51
SLIDE 51

References III

Niki Parmar, Prajit Ramachandran, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jon Shlens. Stand-alone self-attention in vision models. In H. Wallach,

  • H. Larochelle, A. Beygelzimer, F. d’ Alch´

e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 68–80. Curran Associates, Inc., 2019. Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. arXiv preprint arXiv:2005.12872, 2020. Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. On the relationship between self-attention and convolutional layers. In International Conference on Learning Representations, 2020. Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient

  • transformer. arXiv preprint arXiv:2001.04451, 2020.
  • A. Katharopoulos

Transformers are RNNs 3/3