How Much Self-Attention Do We Need? Trading Attention for - - PowerPoint PPT Presentation

how much self attention do we need trading attention for
SMART_READER_LITE
LIVE PREVIEW

How Much Self-Attention Do We Need? Trading Attention for - - PowerPoint PPT Presentation

How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers Kazuki Irie *, Alexander Gerstenberger, Ralf Schl uter, Hermann Ney Human Language Technology and Pattern Recognition Group RWTH Aachen University, Aachen, Germany


slide-1
SLIDE 1

How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers

Kazuki Irie*, Alexander Gerstenberger, Ralf Schl¨ uter, Hermann Ney

Human Language Technology and Pattern Recognition Group RWTH Aachen University, Aachen, Germany

*joining the Swiss AI Lab IDSIA, USI & SUPSI, Lugano, Switzerland

ICASSP 2020, Barcelona, Spain Machine Learning for Language Processing I [TU2.L3.2], May 5, 2020

slide-2
SLIDE 2

Introduction

  • Major trend: large Transformer language models in 2019 - begin 2020.

– OpenAI GPT-2 [Radford & Wu+ 19] – Nvidia Megatron – Microsoft Turing-NLG

  • Applications to ASR (Interspeech 2019)

– N-best rescoring, Nvidia [Li & Lavrukhin+ 19] – Lattice rescoring & Shallow fusion, RWTH Aachen [Irie & Zeyer+ 19].

  • Large ASR improvements over well tuned LSTM language model:

– LibriSpeech SoTA Interspeech 2019 [L¨

uscher & Beck+ 19]

– TED-LIUM 2 SoTA this conference (Friday) [Zhou & Michel+ 20].

with lattice rescoring for hybrid NN/HMM ASR system.

  • In practice: Large memory requirement for search.

For lattice rescoring, more than 100 GB for large lattices for some tasks...

2 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

slide-3
SLIDE 3

Transformer language models in ASR: Large state size

Self-Attention LayerNorm Feed-forward LayerNorm Positional Encoding

  • Transformer LM state: key and value vectors.
  • State size: L (layers) × dkv (key dim.) × 2 (for key and value) × n (position)
  • In principle: To be stored for each hypothesis.

3 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

slide-4
SLIDE 4

Increasing model size, without increasing state size? Objective/Motivation

  • Reduce memory requirement in the original Transformer LM for search!
  • From modeling perspective (quantization etc. can be applied on top of it).
  • Reconsider the original Transformer layer.

Can we efficiently increase the model size w/o increasing the state size?

  • Hyper-parameters in Transformer language model:

– Number of layers: L – Tied key, value, and query dimension: dkv – Feed-forward dimension: dff – Number of attention heads: H

  • State size: 2 × L × dkv × n
  • Only possibility: increase feed-forward dimension dff
  • Put parameters in the feed-forward modules more efficiently?

4 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

slide-5
SLIDE 5

This work: 2 modifications for small state Transformer

1 F feed-forward layers per layer. 2 Sharing key and value matrices.

  • (Reduce number of Transformer layers L).

5 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

slide-6
SLIDE 6

This work: 2 modifications for small state Transformer

1 F feed-forward layers per layer. 2 Sharing key and value matrices.

  • (Reduce number of Transformer layers L).

5 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

slide-7
SLIDE 7

This work: 2 modifications for small state Transformer

1 F feed-forward layers per layer. 2 Sharing key and value matrices.

  • (Reduce number of Transformer layers L).

5 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

slide-8
SLIDE 8

This work: 2 modifications for small state Transformer

1 F feed-forward layers per layer. 2 Sharing key and value matrices.

  • (Reduce number of Transformer layers L).

5 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

slide-9
SLIDE 9

Experimental setups Dataset: TED-LIUM release 2

  • 152 K-word vocabulary.
  • 270 M running words for language model training.
  • 7 subsets including the transcriptions.
  • Minor overlapping problem in the official data (see our paper for details).
  • Some additional experiments on LibriSpeech (to be found in paper).

ASR baseline

  • Dedicated system paper (new SoTA on TED-LIUM 2) Friday at 15:15 - 17:15.

Session: Large Vocabulary Continuous Speech Recognition and Search. Zhou et al. The RWTH ASR System for TED-LIUM Release 2: Improving Hybrid HMM with SpecAugment.

  • Hybrid NN/HMM system.
  • First pass with 4-gram/LSTM.
  • Lattice rescoring to apply LSTM/Transformer language models.

6 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

slide-10
SLIDE 10

Baseline LM setups: TED-LIUM 2 Basic setups

  • 4-gram: interpolation.
  • LSTM: 4 layers, 2048 nodes in each layer.
  • Transformer: 32 layers, 4096 feed-forward dim, 768 hidden units, 12 heads, no

positional encoding. Model # Param. Perplexity [M] Dev Test 4-gram 343 105 125 + pruning 161 113 128 LSTM 450 74 71 Transformer 414

62 61

All language model configurations/models online: https://github.com/rwth-i6/

returnn-experiments/tree/master/2020-lm-small-state-trafo

7 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

slide-11
SLIDE 11

Effect of deep feed-forward module Perplexity results on TED-LIUM 2

L = Number of Transformer layers F = Number of feed-forward layers per Transformer layer

Transformer

F L dff

State size per #Param. Perplexity layer position [K] [M] Dev Test Standard 1 8 4096 12 206 68 65 32 2048

49

313 63 62 4096 414

62 61

Deep 7 6 2048 9 280 65 63 Feed-forward 3 8 4096 12 338 63 62 12 18 379 62 61 16

24

464

61 61

Key/Value dimension (dkv) is fixed to 768 for all models.

  • (L = 8, F = 3)-model only 2% rel. worse than baseline (L = 32, F = 1)
  • with 4 times smaller state size. Also confirmed on LibriSpeech.

8 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

slide-12
SLIDE 12

Effect of sharing KV Perplexity results on TED-LIUM 2 Transformer Layer Shared-KV

L F State size per #Param. Perplexity

position [K] [M] Dev Test Standard No 32 1 49 414

62 61

Yes 25 395 63 61 Deep Feed-forward No 8 3 12 338

63 62

Yes 6 333 66 64

  • Up to 5% degradation for the proposed model with deep feed-forward module.
  • Almost no degradation for the standard model.
  • → Counter-intuitive? More components are affected in the standard case;

should there be more effect?

  • → Intuitive? Model with fewer self-attention layers is affected more.

Importance of these few layers are greater.

9 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

slide-13
SLIDE 13

Knowledge distillation

  • LSTM requires much less memory in ASR search.
  • Knowledge distillation from Transformer to LSTM.

Perplexity results on TED-LIUM 2 Model State size for #Param. Perplexity

n positions [K]

[M] Dev Test Baseline LSTM 16 450 74 71 Teacher Transformer

n×49

414

62 61

Student LSTM 16 450 66 63

  • 10-12% relative improvements over the baseline LSTM.
  • Still behind Transformer teacher, but much smaller memory requirement.
  • Compare our another paper Gerstenberger et al. Domain Robust, Fast, and

Compact Neural Language Models in Session Language Modeling on Friday.

10 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

slide-14
SLIDE 14

ASR results: TED-LIUM 2

  • Hybrid NN/HMM System (Will be presented this Friday [Zhou & Michel+ 20]).
  • First pass with 4-gram + LSTM.
  • Lattice rescoring (→) with Transformer.

Model

L F

Dev Eval PPL WER PPL WER 4-gram + LSTM

  • 64

5.5 69 6.1 → Transformer 32 1

55 5.1 59 5.6

8 3 56 5.2 60 5.7

  • Small state Transformer: similar performance to the standard Trafo.
  • Require much less memory: 16 GB instead 64 GB for the largest lattice.

11 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

slide-15
SLIDE 15

Summary Simple modifications to Transformer layer:

  • (1) F feed-forward layers per layer → works well.

We can reduce the total number of layers, thus self-attention layers.

  • (2) Sharing key and value matrices → Extra reduction in state size w/ some

degradation if combined with (1) The 1:1 ratio in the original Transformer is sub-optimal for the state size. Possible extensions to further reduce memory requirement for search

  • All layers do not need to have self-attention.

→ Lower/Mid layers do not require self-attention.[Irie & Zeyer+ 19] Replace them with static weighted averaging (w/ constant state size).

  • Combine this with fixed memory size Transformers:

e.g. Transformer-XL [Dai & Yang+ 19], Compressive Transformer [Rae & Potapenko+ 20]

12 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020

slide-16
SLIDE 16

Thank you for your attention. Please send your questions to: irie@cs.rwth-aachen.de

slide-17
SLIDE 17

References

[Dai & Yang+ 19] Z. Dai, Z. Yang, Y. Yang, W. W. Cohen, J. Carbonell, Q. V. Le, R. Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Proc. Association for Computational Linguistics (ACL), pp. 2978–2988, Florence, Italy, July 2019. [Irie & Zeyer+ 19] K. Irie, A. Zeyer, R. Schl¨ uter, H. Ney. Language modeling with deep Transformers. In Proc. Interspeech, pp. 3905–3909, Graz, Austria, Sept. 2019. [Li & Lavrukhin+ 19] J. Li, V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Cohen, H. Nguyen, R. T. Gadde. Jasper: An End-to-End Convolutional Neural Acoustic Model. In Proc. Interspeech, pp. 71–75, 2019. [L¨ uscher & Beck+ 19] C. L¨ uscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer, R. Schl¨ uter, H. Ney. RWTH ASR systems for LibriSpeech: Hybrid vs Attention. In Proc. Interspeech, pp. 231–235, Graz, Austria, Sept. 2019. [Radford & Wu+ 19] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever. Language models are unsupervised multitask learners. [Online]. : https://blog.openai.com/better-language-models/, 2019. [Rae & Potapenko+ 20] J. W. Rae, A. Potapenko, S. M. Jayakumar, C. Hillier, T. P. Lillicrap. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, April 2020. [Zhou & Michel+ 20] W. Zhou, W. Michel, K. Irie, M. Kitza, R. Schl¨ uter, H. Ney. The RWTH ASR system for TED-LIUM release 2: Improving hybrid-HMM with SpecAugment. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), to appear, Barcelona, Spain, May 2020. 13 of 13 How Much Self-Attention Do We Need? Trading Attention for Feed-Forward Layers — ICASSP 2020, Barcelona, Spain, May 5, 2020